=========================================================== problem with wake-one semantics and asynchronous exceptions =========================================================== by Andrew Main (Zefram) 2005-09-14 abstract -------- A large class of popular synchronisation algorithms is unsafe in the presence of asynchronous exceptions. This is attributable to the semantics of condition variables of the type described by Hoare and now very widely implemented. It is difficult to write async-safe code using these condition variables. A modification of monitor semantics is presented which yields a class of synchronisation primitives that makes it much easier to write correct async-safe code. table of contents ----------------- 0. problem with mutexes 1. problem with condition variables 2. satisfactory solution 3. generalisations 4. pseudocode implementation 5. references 0. problem with mutexes ======================= A pure mutex is used to ensure that at most one process has access to a shared resource at any time. Processes must hold a lock on the mutex while accessing the resource, and any attempt to lock the mutex while it is already locked will block until the current lock holder releases it. More formally, we can define a mutex thus: A mutex has two possible types of state. Firstly, it may be *unlocked*, in which case it has no further state. Alternatively it may be *locked*, in which case it has exactly one *owning* process and zero or more *waiting* processes. Consider the following obvious implementation, expressed in pseudocode. Sections of code that must operate atomically are delimited by double brackets [[thus]]. struct mutex { owner: a process or NULL; waiters: a set of processes; } function make_mutex() { m := new mutex; m.owner := NULL; m.waiters := EMPTY_SET; return m; } function mutex_lock(m: mutex) { repeat indefinitely [[ if m.owner == NULL, then { m.owner := current_process; return; } else { m.waiters := m.waiters + current_process; suspend current_process; // resume at start of loop } ]] } function mutex_unlock(m: mutex) { [[ if m.owner != current_process, then { error; } m.owner := NULL; if m.waiters != EMPTY_SET, then { p := any process in m.waiters; m.waiters := m.waiters - p; wake p; } ]] } A typical user of a mutex might look like this: struct pipe { mutex: a mutex; queue: a list of objects; } function pipe_write(p: pipe, datum: object) { mutex_lock(p.mutex); old_queue := p.queue; p.queue := append(old_queue, datum); mutex_unlock(p.mutex); } Now consider what an asynchronous exception can do. If an exception occurs in the mutex owner, whatever operation the process was performing on the shared resource is interrupted. In the course of unwinding its stack, the process must get the resource into a consistent state, and then release the mutex to allow other processes access. This might be achieved by completing the planned operation as normal, but in many environments that might not be an option. It also might not be desired, depending on the nature of the exception and the operation. The first part of the cleanup (to get the resource into a consistent state) can often be arranged to be a null action, by careful arrangement of operations. In the pipe example, the pipe state is always consistent, provided that the replacement of p.queue by its new value is atomic. If pipe_write() is interrupted then the datum may or may not have been added to the queue, but either way it is a valid queue. The second part of cleanup, releasing the lock, is also simple, provided that the language gives us control over unwinding: function pipe_write(p: pipe, datum: object) { try { mutex_lock(p.mutex); old_queue := p.queue; p.queue := append(old_queue, datum); } and on unwinding do { if p.mutex.owner == current_process, then { mutex_unlock(p.mutex); } } } The combined operation of acquiring a lock and preparing to release it on unwinding (whether exceptional or not) is pre-packaged in some languages. We can rewrite pipe_write() more concisely using such a feature: function pipe_write(p: pipe, datum: object) { with mutex p.mutex, do { old_queue := p.queue; p.queue := append(old_queue, datum); } } So far so good. But we have not exhausted the possible places where an exception could strike. Consider a mutex locked by process A, with processes B and C waiting to lock it. A is executing its critical section, while B and C are suspended inside mutex_lock(). A completes its critical section and unlocks the mutex, waking B. Before B actually runs, however, it receives an exception. Now when B actually runs it will drop out of mutex_lock() without attempting to lock the mutex. We are left with the mutex unlocked, but C waiting for it! The only way to unwedge C is for another process to lock and then unlock the mutex. This is a bug. The bug actually occurred some way back. When A executed mutex_unlock(), it set m.owner to NULL, and removed B from m.waiters, leaving C. It left the mutex marked as unlocked (m.owner == NULL) but with a non-empty wait set. That violates our definition of the mutex, which says that an unlocked mutex must have no other state. B abandoning its efforts at this point was merely the icing on the cake, revealing the bug by leaving the mutex in its illegal state long enough for us to notice. The code only appeared to work before because, when putting the mutex into this state, it also woke up a process that intended to rectify it. There are three solutions to this bug. One is that we explicitly allow a mutex to have this third type of state, where it has no owner, one or more waiters, and a live process intending to imminently own it. In that case the woken waiter has a responsibility to fulfill if it receives an exception: function mutex_lock(m: mutex) { need_to_wake := FALSE; try { repeat indefinitely [[ if m.owner == NULL, then { m.owner := current_process; need_to_wake := FALSE; return; } else { m.waiters := m.waiters + current_process; need_to_wake := TRUE; suspend current_process; // resume at start of loop } ]] } and on unwinding do [[ if need_to_wake and m.waiters != EMPTY_SET, then { p := any process in m.waiters; m.waiters := m.waiters - p; wake p; } ]] } Thus if the woken prospective locker is interrupted, it will hand off its responsibility to another prospective locker. The second solution is for the unlocker to wake *all* waiters, rather than just one. This can be viewed as a variation on the first solution, with woken waiters not having to hand off their locking responsibility because all possible recipients of that responsibility are already executing it. From another point of view, though, it solves the problem by removing the need for the third type of mutex state: the mutex is never unlocked with suspended waiters. It thus satisfies everyone, but also satisfies no one, because it leads to a thundering herd. The third approach is to stick with the original two-state-type definition, and modify the unlocker to maintain the invariant properly: function mutex_lock(m: mutex) { [[ if m.owner == NULL, then { m.owner := current_process; } else { m.waiters := m.waiters + current_process; suspend current_process; // resume by returning } ]] } function mutex_unlock(m: mutex) { [[ if m.owner != current_process, then { error; } if m.waiters == EMPTY_SET, then { m.owner := NULL; } else { p := any process in m.waiters; m.waiters := m.waiters - p; m.owner := p; wake p; } ]] } Here the mutex ownership is handed off directly to a waiter. If the woken waiter gets an exception immediately, it now owns the mutex and so knows to release it. The mutex invariant is maintained, and (as a bonus) process Z which never holds the mutex but occasionally looks at its state can never see it apparently idle (unlocked) when there are actually processes waiting for it. (An asynchronous exception might also occur to a waiting process while it is suspended. If such an exception is to prematurely terminate the suspension then it must remove the process from the wait set. This detail is the same in all versions of the code above, and causes no further trouble.) 1. problem with condition variables =================================== The previous section has shown a problem in the implementation of mutexes and how to overcome it. This section will examine a similar problem in the use of condition variables. A condition variable is essentially a group of processes waiting for some condition to occur. Processes can join and leave the group, normally leaving when the condition becomes true. A condition variable is always used with a mutex; a mutex plus zero or more condition variables forms the infrastructure of a monitor, as described in [MONITORS]. There is a function that atomically joins a condition group and unlocks the mutex. A process holds the mutex while initially checking for the condition of interest, and if the condition is false then holding the mutex prevents it becoming true, so the atomic wait-and-unlock operation ensures that a process can't miss the condition becoming true. Here's a specific implementation, assuming the direct-handoff version of the mutex code from the previous section: struct condition { waiters: a set of processes; } function make_condition() { c := new condition; c.waiters := EMPTY_SET; return c; } function condition_wait_and_unlock(c: condition, m: mutex) { [[ if m.owner != current_process, then { error; } c.waiters := c.waiters + current_process; if m.waiters == EMPTY_SET, then { m.owner := NULL; } else { p := any process in m.waiters; m.waiters := m.waiters - p; m.owner := p; wake p; } suspend current_process; // resume by returning ]] } function condition_notify_one(c: condition) { [[ if c.waiters != EMPTY_SET, then { p := any process in c.waiters; c.waiters := c.waiters - p; wake p; } ]] } A typical use then look like this: struct pipe { mutex: a mutex; nonempty: a condition; queue: a list of objects; } function pipe_write(p: pipe, datum: object) { with mutex p.mutex, do { mutex_lock(p.mutex); old_queue := p.queue; p.queue := append(old_queue, datum); if old_queue == EMPTY_LIST, then { condition_notify_one(p.nonempty); } } } function pipe_read(p: pipe) { try { repeat indefinitely { mutex_lock(p.mutex); if p.queue != EMPTY_LIST, then { exit loop; } else { condition_wait_and_unlock(p.nonempty, p.mutex); } } old_queue := p.queue; datum := first_item_from(old_queue); new_queue := all_but_first_item_from(old_queue); p.queue := new_queue; if new_queue != EMPTY_LIST, then { condition_notify_one(p.nonempty); } return datum; } and on unwinding do { if p.mutex.owner == current_process, then { mutex_unlock(p.mutex); } } } Again, consider asynchronous exceptions. The interesting case is where processes B and C are waiting, in the condition variable, for the pipe to be non-empty so that they can read it, and then A writes to the pipe making it non-empty. A signals the "nonempty" condition, waking B and leaving C in the condition's wait set. B then receives an exception. B's unwinding will release the mutex if it got it, but is likely to leave the pipe non-empty with C still suspended waiting for non-emptiness. This situation is precisely analogous to the situation with just the mutex. The first part of pipe_read() is just like the initial faulty implementation of mutex_lock(). The problem with the mutex can be analysed as a simple case of this problem with condition variables. Some of the same solutions are available. The first type of solution, to require the interrupted process to clean up, is more complicated here. pipe_read() must be modified to call condition_notify_one() if it is interrupted any time between calling condition_wait_and_unlock() and normal termination. It may or may not hold the mutex when it does this; the latter may be a problem for some implementations of conditions, because access to condition variables is usually controlled by the mutex. There will also be spurious wakes under this solution, a minor annoyance. The second approach, of waking all waiters rather than just one, is easily applied. Implementations of condition variables invariably supply a function that wakes all waiters: function condition_broadcast(c: condition) { [[ old_waiters := c.waiters; c.waiters := EMPTY_SET; wake every process in old_waiters; ]] } Calling condition_broadcast() instead of condition_notify_one() in pipe_read() solves the problem, but only by trampling it under a thundering herd. The third and most satisfactory technique, of direct handoff, turns out very different in this case, and is explored in the next section. Note that the condition_wait_and_unlock() function presented here does not automatically relock the mutex after being woken. More usually the relocking is packaged up as part of the wait operation, [SRFI-18] being a notable exception. Such combined operations make the code more concise, and may (depending on implementation) resolve the case where the woken process is interrupted before regaining the mutex. It does not solve the general problem, but we will henceforth use this approach. condition_wait_and_relock() shall combine condition_wait_and_unlock() with mutex_lock(), and the naive pipe_read() now looks like this: function pipe_read(p: pipe) { with mutex p.mutex, do { while p.queue == EMPTY_LIST, do { condition_wait_and_relock(p.nonempty, p.mutex); } old_queue := p.queue; datum := first_item_from(old_queue); new_queue := all_but_first_item_from(old_queue); p.queue := new_queue; if new_queue != EMPTY_LIST, then { condition_notify_one(p.nonempty); } return datum; } } 2. satisfactory solution ======================== Observe that if pipe_read() is required to clean up the monitor state itself then it usually has to signal the condition. It must signal the condition if it is interrupted while it does not have the mutex; also if it is interrupted before completing its work; and also if the condition remains true after its work is complete. The only situation in which it does not signal the condition is if it got the mutex and completed its work the condition became false as a result. This suggests that signalling the condition should be the default behaviour, rather than explicitly invoked. Ignoring implementation for a moment, we want to write in this style: function pipe_read(p: pipe) { with mutex p.mutex, do { while p.queue == EMPTY_LIST, do { condition_wait_and_relock(p.nonempty, p.mutex); } old_queue := p.queue; datum := first_item_from(old_queue); new_queue := all_but_first_item_from(old_queue); p.queue := new_queue; if new_queue == EMPTY_LIST, then { not going to signal p.nonempty; } return datum; } } In this version of the code, condition_notify_one(p.nonempty) is to be called implicitly when the mutex is unlocked, unless the "not going to ..." statement was reached. condition_notify_one() is never called explicitly. To implement this, the mutex would have to know about the condition variable, and also maintain a record of whether the condition is to be signalled when the mutex is unlocked. The former it can pick up in the condition_wait_and_relock() call, and the latter is trivial given the "not going to ..." statement. But wait. What if there's more than one condition variable? In this case we might need to signal more than one condition on unwinding, and the correct default behaviour (of whether to signal or not) might vary between conditions and depend on other state. When locking the mutex we'd have to say which conditions to implicitly signal on unlock, and then modify that list as we mutate the state. But we can't get this list completely right upon locking, because in the general case it might depend on state that we can't see without holding the lock. It would be necessary to list every possible condition that the function is concerned with and then modify the list once the lock is held. Alternatively, the mutex lock operation could implicitly arrange for *every* condition attached to the mutex to be signalled when the mutex is unlocked. They then wouldn't need to be explicitly listed in the locking function call. However, functions would have to turn off implicit signalling of any conditions they're not concerned with at all. This forces them to know about other parts of the monitor that they otherwise could have ignored. This is most unsatisfactory. The clean solution is that the default signalling behaviour for each condition is *whatever it was last time*. More precisely, each condition variable has an attached Boolean flag indicating whether its condition is presently true; this flag is maintained by the code that alters the state that the condition is concerned with. A mutex must know about all its associated conditions, and whenever the mutex is unlocked it calls condition_notify_one() on each condition whose flag is true. More subtly, it can examine each condition's queue, and wake just a single process that is waiting on *any* true condition. Waking a waiting process also hands mutex ownership to it directly, rather than requiring it to regain the mutex itself. In this system, if a woken process achieves nothing before being interrupted, its unlocking of the mutex will cause the next waiting process to be woken using the same criteria used to wake itself. [MONITORS], while advocating a direct handoff approach, explicitly points out that its condition variables have no Boolean state of the type developed here. "There is a great temptation to introduce a more complex synchronization primitive ... We shall resist this temptation for a while.". 31 years of resistance seems sufficient. 3. generalisations ================== The development in the course of the preceding section has led to a slightly more complex notion of a monitor than was given in [MONITORS]. As before, a monitor consists of a mutex plus zero or more condition variables. The monitor knows about all of its condition variables, which was not required before. Each condition variable now consists of a set of waiting processes and a Boolean flag. The flag in each condition variable indicates whether its waiting processes are eligible to be woken. It is meaningful to add a condition variable to an existing monitor, or to drop an existing one. This was always possible with separate condition variables, but with the new type of monitor the monitor needs to be told that it is being done. The mutex itself has a set of waiting processes, and processes in that set are always eligible to be woken (to be granted the lock). We can generalise by detaching the wait set from the mutex, and instead treating it as an always-true condition variable. The only remaining independent part of the mutex is the ownership record. A process entering its monitored section while the monitor is locked might start by entering this always-on condition variable, as it previously started by entering the mutex's wait set. But it doesn't have to: it can now equally well start with a non-trivial condition variable, entering the condition's wait set directly if the condition is false or the monitor locked. The pipe example would be more concise this way. We can now view a plain mutex as a degenerate case of the new type of monitor: it has exactly one condition variable, with a condition that is always true. It's not necessay to have a condition variable with an always-true condition. The limited-length pipe example in [SRFI-18], implemented there with two condition variables (for conditions "pipe not empty" and "pipe not full"), works as a new-style monitor with just those two condition variables. It is impossible for both conditions to be false simultaneously, so there's always some way to make progress. Hypothetically, if all conditions in a monitor were to become false simultaneously and then the lock released, it would become impossible to enter a monitored section, unless a condition can be set to true without holding the lock. A process might want to wait for two or more known conditions to be true simultaneously. This can be trivially implemented by adding another condition variable for the disjunction of other conditions. More interestingly, the disjunction can be synthesised. When the process wishes to wait for several conditions simultaneously, it first waits for any one of the conditions. Now holding the lock, it checks all the remaining conditions. If all are true, it has succeeded. If any are false, then it waits for any one of the false conditions, and then repeats the check. There is a variation that avoids the need for code to manually maintain the Boolean flags on the condition variables. Instead of having a flag, each condition variable has a thunk (a piece of code) which can be called to determine whether the condition is currently true. Upon mutex unlock, rather than examining the Boolean flags, the thunks are called to determine which wait sets are eligible to be woken. The thunk for a condition is also checked when a process wants to enter a monitored section with that condition. This variant doesn't cope with conditions being changed by outside code while the lock is not held, but that could be rectified by adding a procedure to recheck conditions. Another variation in representing conditions is to merge all the condition variables into a single wait set. This time, each process in the wait set has an associated thunk, just as each condition variable had in the previous variation. Note that the monitor's data structure now resembles the simple mutex that we started with, except for the addition of the thunks. Each process's thunk checks for whatever condition the process requires in order for it to be woken. When the monitor is unlocked, it calls the thunks of the waiting processes, and grants the lock to any process whose thunk returns true. This is extremely flexible, because wake conditions don't have to be known in advance, but is needlessly expensive if many waiting processes have the same wake condition. Having all the monitor's wait queues accessible to the lock code also helps to solve a priority inversion problem that can occur with traditional condition variables. Suppose process A, high priority, is waiting on a condition, and process C, low priority, holds the associated lock. In a traditional monitor it may be impossible to detect the dependency, because A isn't in the mutex's wait set, and so a medium-priority process B could preempt C, and so indirectly A. In the type of monitor described in this paper, A is in a wait set very clearly connected with C's lock, so priority inheritance may be applied. This doesn't solve the case where the condition is false and the monitor unlocked, however. 4. pseudocode implementation ============================ This section contains a pseudocode implementation of the most general form of monitor described in the previous section. This form has the disadvantage of requiring arbitrary predicates to be evaluated in atomic sections, making it impractical to implement in many contexts. Nevertheless, it is conceptually the cleanest, and forms a useful theoretical model when considering the more restricted forms. struct waiter { process: a process; condition: a thunk; } struct conditional_mutex { owner: a process or NULL; waiters: a set of waiters; } function make_conditional_mutex() { cm := new conditional_mutex; cm.owner := NULL; cm.waiters := EMPTY_SET; return cm; } function conditional_mutex_lock(cm: conditional_mutex, cond: thunk) { [[ if cm.owner == NULL and cond(), then { cm.owner := current_process; } else { wtr := new waiter; wtr.process := current_process; wtr.condition := cond; cm.waiters := cm.waiters + wtr; suspend current_process; // resume by returning } ]] } function conditional_mutex_unlock(cm: conditional_mutex) { [[ if cm.owner != current_process, then { error; } for each wtr in cm.waiters, do { if wtr.condition(), then { cm.waiters := cm.waiters - wtr; cm.owner := wtr.process; wake wtr.process; discard wtr; return; // ignore all remaining waiters } } // no runnable waiters cm.owner := NULL; ]] } function conditional_mutex_relock(cm: conditional_mutex, cond: thunk) { [[ if cm.owner != current_process, then { error; } if not cond(), then { wtr := new waiter; wtr.process := current_process; wtr.condition := cond; cm.waiters := cm.waiters + wtr; conditional_mutex_unlock(cm); suspend current_process; // resume by returning } ]] } // to call when a condition might have become true asynchronously function conditional_mutex_recheck(cm: conditional_mutex) { [[ if cm.owner == NULL, then { for each wtr in cm.waiters, do { if wtr.condition(), then { cm.waiters := cm.waiters - wtr; cm.owner := wtr.process; wake wtr.process; discard wtr; return; // ignore all remaining waiters } } } ]] } Because automatically releasing a mutex on unwinding is such a common requirement, the shorthand syntax: with conditional_mutex when , do { } shall stand for the full form: try { conditional_mutex_lock(, { }); } and on unwinding do { if .owner == current_process, then { conditional_mutex_unlock(); } } For demonstration purposes, here is an async-safe implementation of a limited-length pipe object: function pipe_read(p: pipe) { with conditional_mutex p.condmutex when p.queue != EMPTY_LIST, do { old_queue := p.queue; datum := first_item_from(old_queue); new_queue := all_but_first_item_from(old_queue); p.queue := new_queue; return datum; } } function pipe_write(p: pipe, datum: object) { with conditional_mutex p.condmutex when length(p.queue) != MAX_PIPE_LENGTH, do { old_queue := p.queue; p.queue := append(old_queue, datum); } } 5. references ============= [MONITORS] C. A. R. Hoare, "Monitors: An Operating System Structuring Concept", Communications of the ACM, Vol. 17, No. 10, October 1974, pp. 549-557, . [SRFI-18] Marc Feeley, "SRFI 18: Multithreading support", 2001/03/14, .