page-fault

The Concurrency API

Prefer task-based programming to thread-based

If you want to run some work asynchronously, you have two basic choices, thread and async:

auto doWork = [] { /* ... */ }; // some async task
std::thread t { doWork };
auto fut = std::async (doWork);

The async approach is often better, for a few reasons:

The OS only has a limited number of threads it can provide. If more threads are requested than the system can provide, a system_error exception will be thrown, even if the task itself is noexcept.

If you create more software threads than there are available hardware threads, the OS will be forced to time-slice the threads, introducing potentially costly context switches. Avoiding this is difficult because different systems will have different hardware and workloads.

Using std::async allows the system to schedule work using a system-wide thread pool, improving load balancing across hardware threads (although this is not required). Using thread directly will require you to deal with thread exhaustion and load balancing yourself, and your solution may not interact well with similar systems in other software.

There are a few situations in which using thread directly makes sense:

Guidelines

Specify std::launch::async if asynchronicity is essential

There are two launch policies:

If no launch policy is provided, the implementation is free to choose whether to run the task async or deferred. This has some interesting implications: We don’t know when and where the task will run, or even if it will run at all (if get isn’t called on some paths).

Using the default launch policy is only acceptable under the following circumstances:

If any of those conditions cannot be met, you should explicitly request the async launch policy.

You can fairly trivially write a reallyAsync wrapper for std::async:

template <typename... Ts>
auto reallyAsync (Ts&&... ts) {
  return std::async (std::launch::async, std::forward<Ts> (ts)...);
}

Guidelines

Make std::threads unjoinable on all paths

thread objects can be in either ‘joinable’ or ‘unjoinable’ state.

A ‘joinable’ thread is a thread that is (or is waiting to be) running.

An ‘unjoinable’ thread might be:

If the destructor of a thread in a joinable state is invoked, the program will be terminated. This sounds bad, but the alternatives are worse:

A sudden program termination should be easier to debug than either of the other two cases.

To guarantee that a thread is made unjoinable on every path, you can write a straightforward RAII wrapper for thread that joins/detaches in its destructor. Be careful using this though, for the reasons mentioned earlier.

There’s no direct support in the C++ stdlib for ‘interruptible threads’, but they can be implemented by hand using the provided primitives.

Guidelines

Be aware of varying thread handle destructor behaviour

The result of a std::async call cannot be stored in an object associated with the callee, as the callee might finish before get is called on the future. In this scenario, objects local to the callee would be destroyed before they could be read.

Similarly, the result cannot be stored directly in the associated future, because this object might be moved or converted to a shared_future, so there wouldn’t be a constant target address to which to write the result.

Instead, there’s typically a shared object stored on the heap known as the ‘Shared State’ which is referenced by both the promise that sends the result and the future which receives it.

The behaviour of a future’s destructor is determined by this shared state:

That is, async futures launched by std::async are a special case with different behaviour. This is important if creating futures using other mechanisms, like promise or packaged_task.

Guidelines

Consider void futures for one-shot event communication

There are a few ways to signal events between threads. One option might be to use a std::condition_variable:

// shared state
std::condition_variable cv;
std::mutex m;

// event source
cv.notify_one();

// event receiver
{
  std::unique_lock lk { m }; // unique_lock supports CTAD
  cv.wait (lk);
}

This approach has some issues:

Another option is to use a boolean flag, set it from the signalling thread, and poll it from the waiting thread, but the polling is very wasteful.

The two approaches can be combined (i.e. set a bool flag and then notify the condvar, check the bool flag after waking) but that’s a lot of boilerplatey code that’s awkward to reuse cleanly.

A much cleaner approach is to to use a std::promise<void>:

// shared state
std::promise<void> p;

// event source
p.set_value();

// event receiver
p.get_future().wait();

This approach is more concise, will work even if the set_value is called before wait, and is immune to spurious wakeups. However, it does require a heap allocation for the shared state, and the system can only be used to signal a single event.

When combined with shared futures, this pattern can be used to send notifications to multiple threads at once:

std::promise<void> p;
auto sf = p.get_future().share();

std::vector<std::thread> vt;

for (auto i = 0; i != 2; ++i)
  vt.emplace_back ([sf] { sf.wait(); });

p.set_value(); // all threads waiting on the shared future will be notified

for (auto& t : vt)
  t.join();

Guidelines

Use std::atomic for concurrency, volatile for special memory

atomic and volatile have different meanings. Operations on atomic<T> objects are guaranteed to seen as atomic from other threads. volatile does not give this guarantee. Multiple simultaneous readers/writers of volatile storage is still a data race.

Using atomics (with default settings) also ensures sequential consistency, whereas volatile does not, so the optimiser might do surprising things with code that just uses volatile.

So what is volatile actually good for? In normal code, the optimiser is allowed to remove redundant loads and dead stores (series of non-interleaved reads or writes to an address) because such code wouldn’t have any useful side-effects. However, for some types of special memory (e.g. memory-mapped IO), addresses might be read/written by peripherals, so these chains of reads/writes would no longer be redundant: reading the same address twice in a row might yield different results, and setting the same address twice in a row might adjust settings on some peripheral.

volatile just tells the compiler not to remove ‘redundant’ loads and stores.

Guidelines