===========================================================
        problem with wake-one semantics and asynchronous exceptions
        ===========================================================

                 by Andrew Main (Zefram) <zefram@fysh.org>

                                2005-09-14

abstract
--------

A large class of popular synchronisation algorithms is unsafe in the
presence of asynchronous exceptions.  This is attributable to the
semantics of condition variables of the type described by Hoare and
now very widely implemented.  It is difficult to write async-safe code
using these condition variables.  A modification of monitor semantics
is presented which yields a class of synchronisation primitives that
makes it much easier to write correct async-safe code.

table of contents
-----------------

0. problem with mutexes
1. problem with condition variables
2. satisfactory solution
3. generalisations
4. pseudocode implementation
5. references

0. problem with mutexes
=======================

A pure mutex is used to ensure that at most one process has access to
a shared resource at any time.  Processes must hold a lock on the mutex
while accessing the resource, and any attempt to lock the mutex while it
is already locked will block until the current lock holder releases it.
More formally, we can define a mutex thus:

    A mutex has two possible types of state.  Firstly, it may be
    *unlocked*, in which case it has no further state.  Alternatively it
    may be *locked*, in which case it has exactly one *owning* process
    and zero or more *waiting* processes.

Consider the following obvious implementation, expressed in pseudocode.
Sections of code that must operate atomically are delimited by double
brackets [[thus]].

    struct mutex {
        owner: a process or NULL;
        waiters: a set of processes;
    }

    function make_mutex() {
        m := new mutex;
        m.owner := NULL;
        m.waiters := EMPTY_SET;
        return m;
    }

    function mutex_lock(m: mutex) {
        repeat indefinitely [[
            if m.owner == NULL, then {
                m.owner := current_process;
                return;
            } else {
                m.waiters := m.waiters + current_process;
                suspend current_process; // resume at start of loop
            }
        ]]
    }

    function mutex_unlock(m: mutex) {
        [[
            if m.owner != current_process, then {
                error;
            }
            m.owner := NULL;
            if m.waiters != EMPTY_SET, then {
                p := any process in m.waiters;
                m.waiters := m.waiters - p;
                wake p;
            }
        ]]
    }

A typical user of a mutex might look like this:

    struct pipe {
        mutex: a mutex;
        queue: a list of objects;
    }

    function pipe_write(p: pipe, datum: object) {
        mutex_lock(p.mutex);
        old_queue := p.queue;
        p.queue := append(old_queue, datum);
        mutex_unlock(p.mutex);
    }

Now consider what an asynchronous exception can do.  If an exception
occurs in the mutex owner, whatever operation the process was performing
on the shared resource is interrupted.  In the course of unwinding
its stack, the process must get the resource into a consistent state,
and then release the mutex to allow other processes access.  This might
be achieved by completing the planned operation as normal, but in many
environments that might not be an option.  It also might not be desired,
depending on the nature of the exception and the operation.

The first part of the cleanup (to get the resource into a consistent
state) can often be arranged to be a null action, by careful arrangement
of operations.  In the pipe example, the pipe state is always consistent,
provided that the replacement of p.queue by its new value is atomic.
If pipe_write() is interrupted then the datum may or may not have been
added to the queue, but either way it is a valid queue.  The second
part of cleanup, releasing the lock, is also simple, provided that the
language gives us control over unwinding:

    function pipe_write(p: pipe, datum: object) {
        try {
            mutex_lock(p.mutex);
            old_queue := p.queue;
            p.queue := append(old_queue, datum);
        } and on unwinding do {
            if p.mutex.owner == current_process, then {
                mutex_unlock(p.mutex);
            }
        }
    }

The combined operation of acquiring a lock and preparing to release it on
unwinding (whether exceptional or not) is pre-packaged in some languages.
We can rewrite pipe_write() more concisely using such a feature:

    function pipe_write(p: pipe, datum: object) {
        with mutex p.mutex, do {
            old_queue := p.queue;
            p.queue := append(old_queue, datum);
        }
    }

So far so good.  But we have not exhausted the possible places where
an exception could strike.  Consider a mutex locked by process A,
with processes B and C waiting to lock it.  A is executing its critical
section, while B and C are suspended inside mutex_lock().  A completes
its critical section and unlocks the mutex, waking B.  Before B actually
runs, however, it receives an exception.  Now when B actually runs it
will drop out of mutex_lock() without attempting to lock the mutex.
We are left with the mutex unlocked, but C waiting for it!  The only way
to unwedge C is for another process to lock and then unlock the mutex.
This is a bug.

The bug actually occurred some way back.  When A executed mutex_unlock(),
it set m.owner to NULL, and removed B from m.waiters, leaving C.  It left
the mutex marked as unlocked (m.owner == NULL) but with a non-empty
wait set.  That violates our definition of the mutex, which says that
an unlocked mutex must have no other state.  B abandoning its efforts
at this point was merely the icing on the cake, revealing the bug by
leaving the mutex in its illegal state long enough for us to notice.
The code only appeared to work before because, when putting the mutex
into this state, it also woke up a process that intended to rectify it.

There are three solutions to this bug.  One is that we explicitly allow
a mutex to have this third type of state, where it has no owner, one
or more waiters, and a live process intending to imminently own it.
In that case the woken waiter has a responsibility to fulfill if it
receives an exception:

    function mutex_lock(m: mutex) {
        need_to_wake := FALSE;
        try {
            repeat indefinitely [[
                if m.owner == NULL, then {
                    m.owner := current_process;
                    need_to_wake := FALSE;
                    return;
                } else {
                    m.waiters := m.waiters + current_process;
                    need_to_wake := TRUE;
                    suspend current_process; // resume at start of loop
                }
            ]]
        } and on unwinding do [[
            if need_to_wake and m.waiters != EMPTY_SET, then {
                p := any process in m.waiters;
                m.waiters := m.waiters - p;
                wake p;
            }
        ]]
    }

Thus if the woken prospective locker is interrupted, it will hand off
its responsibility to another prospective locker.

The second solution is for the unlocker to wake *all* waiters, rather than
just one.  This can be viewed as a variation on the first solution, with
woken waiters not having to hand off their locking responsibility because
all possible recipients of that responsibility are already executing it.
From another point of view, though, it solves the problem by removing
the need for the third type of mutex state: the mutex is never unlocked
with suspended waiters.  It thus satisfies everyone, but also satisfies
no one, because it leads to a thundering herd.

The third approach is to stick with the original two-state-type
definition, and modify the unlocker to maintain the invariant properly:

    function mutex_lock(m: mutex) {
        [[
            if m.owner == NULL, then {
                m.owner := current_process;
            } else {
                m.waiters := m.waiters + current_process;
                suspend current_process; // resume by returning
            }
        ]]
    }

    function mutex_unlock(m: mutex) {
        [[
            if m.owner != current_process, then {
                error;
            }
            if m.waiters == EMPTY_SET, then {
                m.owner := NULL;
            } else {
                p := any process in m.waiters;
                m.waiters := m.waiters - p;
                m.owner := p;
                wake p;
            }
        ]]
    }

Here the mutex ownership is handed off directly to a waiter.  If the
woken waiter gets an exception immediately, it now owns the mutex and
so knows to release it.  The mutex invariant is maintained, and (as a
bonus) process Z which never holds the mutex but occasionally looks at
its state can never see it apparently idle (unlocked) when there are
actually processes waiting for it.

(An asynchronous exception might also occur to a waiting process while
it is suspended.  If such an exception is to prematurely terminate
the suspension then it must remove the process from the wait set.
This detail is the same in all versions of the code above, and causes
no further trouble.)

1. problem with condition variables
===================================

The previous section has shown a problem in the implementation of mutexes
and how to overcome it.  This section will examine a similar problem in
the use of condition variables.

A condition variable is essentially a group of processes waiting for some
condition to occur.  Processes can join and leave the group, normally
leaving when the condition becomes true.  A condition variable is always
used with a mutex; a mutex plus zero or more condition variables forms
the infrastructure of a monitor, as described in [MONITORS].

There is a function that atomically joins a condition group and unlocks
the mutex.  A process holds the mutex while initially checking for the
condition of interest, and if the condition is false then holding the
mutex prevents it becoming true, so the atomic wait-and-unlock operation
ensures that a process can't miss the condition becoming true.

Here's a specific implementation, assuming the direct-handoff version
of the mutex code from the previous section:

    struct condition {
        waiters: a set of processes;
    }

    function make_condition() {
        c := new condition;
        c.waiters := EMPTY_SET;
        return c;
    }

    function condition_wait_and_unlock(c: condition, m: mutex) {
        [[
            if m.owner != current_process, then {
                error;
            }
            c.waiters := c.waiters + current_process;
            if m.waiters == EMPTY_SET, then {
                m.owner := NULL;
            } else {
                p := any process in m.waiters;
                m.waiters := m.waiters - p;
                m.owner := p;
                wake p;
            }
            suspend current_process; // resume by returning
        ]]
    }

    function condition_notify_one(c: condition) {
        [[
            if c.waiters != EMPTY_SET, then {
                p := any process in c.waiters;
                c.waiters := c.waiters - p;
                wake p;
            }
        ]]
    }

A typical use then look like this:

    struct pipe {
        mutex: a mutex;
        nonempty: a condition;
        queue: a list of objects;
    }

    function pipe_write(p: pipe, datum: object) {
        with mutex p.mutex, do {
            mutex_lock(p.mutex);
            old_queue := p.queue;
            p.queue := append(old_queue, datum);
            if old_queue == EMPTY_LIST, then {
                condition_notify_one(p.nonempty);
            }
        }
    }

    function pipe_read(p: pipe) {
        try {
            repeat indefinitely {
                mutex_lock(p.mutex);
                if p.queue != EMPTY_LIST, then {
                    exit loop;
                } else {
                    condition_wait_and_unlock(p.nonempty, p.mutex);
                }
            }
            old_queue := p.queue;
            datum := first_item_from(old_queue);
            new_queue := all_but_first_item_from(old_queue);
            p.queue := new_queue;
            if new_queue != EMPTY_LIST, then {
                condition_notify_one(p.nonempty);
            }
            return datum;
        } and on unwinding do {
            if p.mutex.owner == current_process, then {
                mutex_unlock(p.mutex);
            }
        }
    }

Again, consider asynchronous exceptions.  The interesting case is where
processes B and C are waiting, in the condition variable, for the pipe
to be non-empty so that they can read it, and then A writes to the pipe
making it non-empty.  A signals the "nonempty" condition, waking B and
leaving C in the condition's wait set.  B then receives an exception.
B's unwinding will release the mutex if it got it, but is likely to leave
the pipe non-empty with C still suspended waiting for non-emptiness.

This situation is precisely analogous to the situation with just
the mutex.  The first part of pipe_read() is just like the initial
faulty implementation of mutex_lock().  The problem with the mutex can
be analysed as a simple case of this problem with condition variables.
Some of the same solutions are available.

The first type of solution, to require the interrupted process to clean
up, is more complicated here.  pipe_read() must be modified to call
condition_notify_one() if it is interrupted any time between calling
condition_wait_and_unlock() and normal termination.  It may or may not
hold the mutex when it does this; the latter may be a problem for some
implementations of conditions, because access to condition variables
is usually controlled by the mutex.  There will also be spurious wakes
under this solution, a minor annoyance.

The second approach, of waking all waiters rather than just one,
is easily applied.  Implementations of condition variables invariably
supply a function that wakes all waiters:

    function condition_broadcast(c: condition) {
        [[
            old_waiters := c.waiters;
            c.waiters := EMPTY_SET;
            wake every process in old_waiters;
        ]]
    }

Calling condition_broadcast() instead of condition_notify_one()
in pipe_read() solves the problem, but only by trampling it under a
thundering herd.

The third and most satisfactory technique, of direct handoff, turns out
very different in this case, and is explored in the next section.

Note that the condition_wait_and_unlock() function presented here does
not automatically relock the mutex after being woken.  More usually
the relocking is packaged up as part of the wait operation, [SRFI-18]
being a notable exception.  Such combined operations make the code more
concise, and may (depending on implementation) resolve the case where
the woken process is interrupted before regaining the mutex.  It does
not solve the general problem, but we will henceforth use this approach.
condition_wait_and_relock() shall combine condition_wait_and_unlock()
with mutex_lock(), and the naive pipe_read() now looks like this:

    function pipe_read(p: pipe) {
        with mutex p.mutex, do {
            while p.queue == EMPTY_LIST, do {
                condition_wait_and_relock(p.nonempty, p.mutex);
            }
            old_queue := p.queue;
            datum := first_item_from(old_queue);
            new_queue := all_but_first_item_from(old_queue);
            p.queue := new_queue;
            if new_queue != EMPTY_LIST, then {
                condition_notify_one(p.nonempty);
            }
            return datum;
        }
    }

2. satisfactory solution
========================

Observe that if pipe_read() is required to clean up the monitor state
itself then it usually has to signal the condition.  It must signal the
condition if it is interrupted while it does not have the mutex; also if
it is interrupted before completing its work; and also if the condition
remains true after its work is complete.  The only situation in which
it does not signal the condition is if it got the mutex and completed
its work the condition became false as a result.  This suggests that
signalling the condition should be the default behaviour, rather than
explicitly invoked.  Ignoring implementation for a moment, we want to
write in this style:

    function pipe_read(p: pipe) {
        with mutex p.mutex, do {
            while p.queue == EMPTY_LIST, do {
                condition_wait_and_relock(p.nonempty, p.mutex);
            }
            old_queue := p.queue;
            datum := first_item_from(old_queue);
            new_queue := all_but_first_item_from(old_queue);
            p.queue := new_queue;
            if new_queue == EMPTY_LIST, then {
                not going to signal p.nonempty;
            }
            return datum;
        }
    }

In this version of the code, condition_notify_one(p.nonempty) is to
be called implicitly when the mutex is unlocked, unless the "not going
to ..." statement was reached.  condition_notify_one() is never called
explicitly.  To implement this, the mutex would have to know about the
condition variable, and also maintain a record of whether the condition
is to be signalled when the mutex is unlocked.  The former it can pick
up in the condition_wait_and_relock() call, and the latter is trivial
given the "not going to ..." statement.

But wait.  What if there's more than one condition variable?  In this
case we might need to signal more than one condition on unwinding,
and the correct default behaviour (of whether to signal or not) might
vary between conditions and depend on other state.  When locking the
mutex we'd have to say which conditions to implicitly signal on unlock,
and then modify that list as we mutate the state.  But we can't get
this list completely right upon locking, because in the general case
it might depend on state that we can't see without holding the lock.
It would be necessary to list every possible condition that the function
is concerned with and then modify the list once the lock is held.

Alternatively, the mutex lock operation could implicitly arrange
for *every* condition attached to the mutex to be signalled when the
mutex is unlocked.  They then wouldn't need to be explicitly listed in
the locking function call.  However, functions would have to turn off
implicit signalling of any conditions they're not concerned with at all.
This forces them to know about other parts of the monitor that they
otherwise could have ignored.  This is most unsatisfactory.

The clean solution is that the default signalling behaviour for each
condition is *whatever it was last time*.  More precisely, each condition
variable has an attached Boolean flag indicating whether its condition
is presently true; this flag is maintained by the code that alters the
state that the condition is concerned with.  A mutex must know about all
its associated conditions, and whenever the mutex is unlocked it calls
condition_notify_one() on each condition whose flag is true.  More subtly,
it can examine each condition's queue, and wake just a single process that
is waiting on *any* true condition.  Waking a waiting process also hands
mutex ownership to it directly, rather than requiring it to regain the
mutex itself.  In this system, if a woken process achieves nothing before
being interrupted, its unlocking of the mutex will cause the next waiting
process to be woken using the same criteria used to wake itself.

[MONITORS], while advocating a direct handoff approach, explicitly
points out that its condition variables have no Boolean state of the
type developed here.  "There is a great temptation to introduce a more
complex synchronization primitive ... We shall resist this temptation
for a while.".  31 years of resistance seems sufficient.

3. generalisations
==================

The development in the course of the preceding section has led to a
slightly more complex notion of a monitor than was given in [MONITORS].
As before, a monitor consists of a mutex plus zero or more condition
variables.  The monitor knows about all of its condition variables, which
was not required before.  Each condition variable now consists of a set
of waiting processes and a Boolean flag.  The flag in each condition
variable indicates whether its waiting processes are eligible to be woken.

It is meaningful to add a condition variable to an existing monitor,
or to drop an existing one.  This was always possible with separate
condition variables, but with the new type of monitor the monitor needs
to be told that it is being done.

The mutex itself has a set of waiting processes, and processes in that
set are always eligible to be woken (to be granted the lock).  We can
generalise by detaching the wait set from the mutex, and instead treating
it as an always-true condition variable.  The only remaining independent
part of the mutex is the ownership record.  A process entering its
monitored section while the monitor is locked might start by entering
this always-on condition variable, as it previously started by entering
the mutex's wait set.  But it doesn't have to: it can now equally well
start with a non-trivial condition variable, entering the condition's
wait set directly if the condition is false or the monitor locked.
The pipe example would be more concise this way.

We can now view a plain mutex as a degenerate case of the new type of
monitor: it has exactly one condition variable, with a condition that
is always true.

It's not necessay to have a condition variable with an always-true
condition.  The limited-length pipe example in [SRFI-18], implemented
there with two condition variables (for conditions "pipe not empty"
and "pipe not full"), works as a new-style monitor with just those
two condition variables.  It is impossible for both conditions to be
false simultaneously, so there's always some way to make progress.
Hypothetically, if all conditions in a monitor were to become false
simultaneously and then the lock released, it would become impossible to
enter a monitored section, unless a condition can be set to true without
holding the lock.

A process might want to wait for two or more known conditions to be
true simultaneously.  This can be trivially implemented by adding
another condition variable for the disjunction of other conditions.
More interestingly, the disjunction can be synthesised.  When the process
wishes to wait for several conditions simultaneously, it first waits
for any one of the conditions.  Now holding the lock, it checks all
the remaining conditions.  If all are true, it has succeeded.  If any
are false, then it waits for any one of the false conditions, and then
repeats the check.

There is a variation that avoids the need for code to manually maintain
the Boolean flags on the condition variables.  Instead of having a
flag, each condition variable has a thunk (a piece of code) which can be
called to determine whether the condition is currently true.  Upon mutex
unlock, rather than examining the Boolean flags, the thunks are called
to determine which wait sets are eligible to be woken.  The thunk for
a condition is also checked when a process wants to enter a monitored
section with that condition.  This variant doesn't cope with conditions
being changed by outside code while the lock is not held, but that could
be rectified by adding a procedure to recheck conditions.

Another variation in representing conditions is to merge all the condition
variables into a single wait set.  This time, each process in the wait
set has an associated thunk, just as each condition variable had in the
previous variation.  Note that the monitor's data structure now resembles
the simple mutex that we started with, except for the addition of the
thunks.  Each process's thunk checks for whatever condition the process
requires in order for it to be woken.  When the monitor is unlocked,
it calls the thunks of the waiting processes, and grants the lock to any
process whose thunk returns true.  This is extremely flexible, because
wake conditions don't have to be known in advance, but is needlessly
expensive if many waiting processes have the same wake condition.

Having all the monitor's wait queues accessible to the lock code
also helps to solve a priority inversion problem that can occur with
traditional condition variables.  Suppose process A, high priority,
is waiting on a condition, and process C, low priority, holds the
associated lock.  In a traditional monitor it may be impossible to
detect the dependency, because A isn't in the mutex's wait set, and
so a medium-priority process B could preempt C, and so indirectly A.
In the type of monitor described in this paper, A is in a wait set very
clearly connected with C's lock, so priority inheritance may be applied.
This doesn't solve the case where the condition is false and the monitor
unlocked, however.

4. pseudocode implementation
============================

This section contains a pseudocode implementation of the most general
form of monitor described in the previous section.  This form has
the disadvantage of requiring arbitrary predicates to be evaluated in
atomic sections, making it impractical to implement in many contexts.
Nevertheless, it is conceptually the cleanest, and forms a useful
theoretical model when considering the more restricted forms.

    struct waiter {
        process: a process;
        condition: a thunk;
    }

    struct conditional_mutex {
        owner: a process or NULL;
        waiters: a set of waiters;
    }

    function make_conditional_mutex() {
        cm := new conditional_mutex;
        cm.owner := NULL;
        cm.waiters := EMPTY_SET;
        return cm;
    }

    function conditional_mutex_lock(cm: conditional_mutex, cond: thunk) {
        [[
            if cm.owner == NULL and cond(), then {
                cm.owner := current_process;
            } else {
                wtr := new waiter;
                wtr.process := current_process;
                wtr.condition := cond;
                cm.waiters := cm.waiters + wtr;
                suspend current_process; // resume by returning
            }
        ]]
    }

    function conditional_mutex_unlock(cm: conditional_mutex) {
        [[
            if cm.owner != current_process, then {
                error;
            }
            for each wtr in cm.waiters, do {
                if wtr.condition(), then {
                    cm.waiters := cm.waiters - wtr;
                    cm.owner := wtr.process;
                    wake wtr.process;
                    discard wtr;
                    return; // ignore all remaining waiters
                }
            }
            // no runnable waiters
            cm.owner := NULL;
        ]]
    }

    function conditional_mutex_relock(cm: conditional_mutex,
                                      cond: thunk) {
        [[
            if cm.owner != current_process, then {
                error;
            }
            if not cond(), then {
                wtr := new waiter;
                wtr.process := current_process;
                wtr.condition := cond;
                cm.waiters := cm.waiters + wtr;
                conditional_mutex_unlock(cm);
                suspend current_process; // resume by returning
            }
        ]]
    }

    // to call when a condition might have become true asynchronously
    function conditional_mutex_recheck(cm: conditional_mutex) {
        [[
            if cm.owner == NULL, then {
                for each wtr in cm.waiters, do {
                    if wtr.condition(), then {
                        cm.waiters := cm.waiters - wtr;
                        cm.owner := wtr.process;
                        wake wtr.process;
                        discard wtr;
                        return; // ignore all remaining waiters
                    }
                }
            }
        ]]
    }

Because automatically releasing a mutex on unwinding is such a common
requirement, the shorthand syntax:

    with conditional_mutex <condmutex> when <condition>, do {
        <body>
    }

shall stand for the full form:

    try {
        conditional_mutex_lock(<condmutex>, { <condition> });
        <body>
    } and on unwinding do {
        if <condmutex>.owner == current_process, then {
            conditional_mutex_unlock(<condmutex>);
        }
    }

For demonstration purposes, here is an async-safe implementation of a
limited-length pipe object:

    function pipe_read(p: pipe) {
        with conditional_mutex p.condmutex
                    when p.queue != EMPTY_LIST, do {
            old_queue := p.queue;
            datum := first_item_from(old_queue);
            new_queue := all_but_first_item_from(old_queue);
            p.queue := new_queue;
            return datum;
        }
    }

    function pipe_write(p: pipe, datum: object) {
        with conditional_mutex p.condmutex
                    when length(p.queue) != MAX_PIPE_LENGTH, do {
            old_queue := p.queue;
            p.queue := append(old_queue, datum);
        }
    }

5. references
=============

[MONITORS]   C. A. R. Hoare, "Monitors: An Operating System Structuring
             Concept", Communications of the ACM, Vol. 17, No. 10, October
             1974, pp. 549-557, <http://www.acm.org/classics/feb96/>.

[SRFI-18]    Marc Feeley, "SRFI 18: Multithreading support", 2001/03/14,
             <http://srfi.schemers.org/srfi-18/srfi-18.html>.