Concurrency and Exceptions

by David Le Bansais (modified: 2011 May 06)

Contents

There is no simple solution
Single-client calculator
Multi-client calculator
About processor queues
Summary and conclusion
References

Exceptions in the context of concurrent Eiffel programming (i.e. SCOOP) are implemented using a straightforward policy: the program is halted as soon as the exception occurs. Here I show why sometimes a programmer cannot really avoid this default behavior even with a rescue clause. A more flexible mechanism is therefore desirable. I introduce an exception model viable in the context of concurrent processors, and respectful of the exception model already in use in single-threaded programs.

In single-threaded Eiffel programs, a routine that receives an exception can usually handle it by restoring the object's state when the routine was entered, and set some flag to indicate that the object is now in a state that requires some cleanup. Here is a quick example of this strategy:

class CALCULATOR create make feature {NONE} -- Initialization make do last_value := 0.0 has_crashed := false end feature {NONE} last_value: DOUBLE feature has_crashed: BOOLEAN current_value: DOUBLE require not has_crashed do result := last_value end feature divide (fp: DOUBLE) require not has_crashed local restore_value: DOUBLE do if not has_crashed then restore_value := last_value last_value := last_value / fp end ensure last_value = old last_value / fp rescue has_crashed := true last_value := restore_value retry -- Optional, see below end end

The last rescue clause can either have a retry instruction, or not. With a retry, the exception handling is completely safe in that it becomes the responsibility of the caller to check that the CALCULATOR object is in a good state, or crashed. The information is available at any time.

Without a retry, the exception is escalated to the caller, but the object can still be queried to check if its crashed or not. Letting the caller know about the exception is just a design preference.

In a concurrent context, the retry instruction is no longer an option, because without it the entire program will halt. The consequence is that all objects that are expected to generate exceptions at some point must be designed more or less as shown in this quick example.

There are, however some situations where it's not that easy:

In creation procedures, there is no state to restore. Therefore, if the object creation depends on some required parameters it might be impossible to end the creation with a valid object. This situation might look like an academic case at first glance, but there exists at least one example of such a class, the FILE class. Indeed, if you try to open a file that doesn't exist it will generate an exception, in {FILE}.make_open_read for instance.
In queries, there must be a result to return -- this is of course even more necessary in void safe programs. However sometimes there is no good return value that can be used to indicate an error. Consider this quick and dirty implementation of divide:

divide (r1, r2: DOUBLE): DOUBLE do result := r1 / r2 endFor a discussion on why NaN isn't a good solution, see http://bertrandmeyer.com/2010/02/06/reflexivity-and-other-pillars-of-civilization/ and http://www.jot.fm/issues/issue_2006_03/column8/index.html.

It follows from these considerations that a model where exceptions can be handled by a multi-threaded program is very desirable. Otherwise most of these programs would be plagued with if obj.has_crashed then ... end lines.

There is no simple solution

After having experimented with several of the SCOOP examples, I've come to the conclusion that no simple solution can cover all situations. In this section I will explain why with the example of a CALCULATOR class. It's a long example, I'm afraid, but it helps understanding the issue and I will reuse it in the next section to describe my proposal.

CALCULATOR is used to perform the most basic operations on floating point numbers: add, subtract, multiply and divide. Because there is usually only one co-processor on the computer, in a multi-threaded program each thread of execution must restore its own floating point context before it can perform the operation. This reset has a high performance cost, therefore it's desirable to have the calculator sit on its own (SCOOP) processor (I'll use the SCOOP meaning of this word from now on).

The first implementation of CALCULATOR will not perform this optimization. It's mostly there to demonstrate a client-server relation between two objects on different (SCOOP) processors, where the server (CALCULATOR) is designed to have only one client. After a discussion on the best exception model in this case, I will extend the class handle to multiple clients, a code closer to what a real CALCULATOR class would look like. With a single calculator serving several requests from different processors concurrently, proper floating point optimization can be obtained.

Single-client calculator

The example begins with the APPLICATION root class, and uses a small USER_IO class to exercise the calculator.

class APPLICATION create make feature {NONE} -- Initialization make local calculator: separate CALCULATOR user: separate USER_IO do create calculator.make create user calculate_until_exit(calculator, user) end calculate_until_exit (a_calculator: separate CALCULATOR; a_user: separate USER_IO) local exit: BOOLEAN do from a_calculator.reset until exit loop a_user.display_value(a_calculator.current_value) inspect a_user.next_operation when '+' then a_calculator.add(a_user.value) when '-' then a_calculator.subtract(a_user.value) when '*' then a_calculator.multiply(a_user.value) when '/' then a_calculator.divide(a_user.value) else exit := true end end rescue a_user.display_invalid retry end endclass USER_IO feature value: DOUBLE next_operation: CHARACTER local operation: CHARACTER do io.read_character operation := io.last_character inspect operation when '+', '-', '*', '/' then io.read_double value := io.last_double else end result := operation end display_value (fp: DOUBLE) do io.new_line io.put_string("=> ") io.put_double(fp) io.put_string(" ") end display_invalid do io.new_line io.put_string("invalid") end endclass CALCULATOR create make feature {NONE} -- Initialization make do last_value := 0.0 has_crashed := false end feature {NONE} last_value: DOUBLE feature has_crashed: BOOLEAN current_value: DOUBLE require not has_crashed do result := last_value end feature reset do last_value := 0.0 has_crashed := false ensure last_value = 0 not has_crashed end feature add (fp: DOUBLE) require not has_crashed do last_value := last_value + fp ensure last_value = old last_value + fp rescue has_crashed := true end subtract (fp: DOUBLE) require not has_crashed do last_value := last_value - fp ensure last_value = old last_value - fp rescue has_crashed := true end multiply (fp: DOUBLE) require not has_crashed do last_value := last_value * fp ensure last_value = old last_value * fp rescue has_crashed := true end divide (fp: DOUBLE) require not has_crashed do last_value := last_value / fp ensure last_value = old last_value / fp rescue has_crashed := true end end

I will concentrate on the discussion of exception handling when a divide by zero exception occurs.

As you can see, {CALCULATOR}.divide doesn't have a retry instruction. Therefore, if you run this program and input a divisor of zero, you'll get a crash. This is in line with the exception model as currently implemented.

My CALCULATOR class is designed expecting a different model, much closer to the sequential Eiffel model:

If an exception occurs in routine r, at the end of r (if there was any retries, at the end of r when the rescue code didn't retry), the object is flagged 'dirty'. Note that CALCULATOR implements its own has_crashed attribute as well, but this has nothing to do with the 'dirty' flag.
Commands applied to a dirty object are discarded. Hence, if CALCULATOR is called with divide(0) followed by add(1), the invocation of add is ignored and not queued on the processor. Its preconditions as well as CALCULATOR invariant (if there was one) aren't checked either.
Queries applied to a dirty objects are discarded as well, however the first of such queries clears the 'dirty' flag. If CALCULATOR is called with divide(0) followed by current_value, then current_value again, the second invocation of current_value will be processed, and generate a precondition violated exception since has_crashed is still true.

Now that I have stated the new rules of the game, we can follow the flow of execution between processors:

Assume {CALCULATOR}.divide(0) is called. Upon return, the calculator object is marked as dirty.
During this time, we've been looping in the main processor's object (in {APPLICATION}.calculate_until_exit) and eventually {CALCULATOR}.current_value is called.
This is a query, so according to the rules the calculator object dirty flag is cleared, the query is ignored and the caller receives an exception.
Remember that this exception happens on the main processor in {APPLICATION}.calculate_until_exit, so the rescue clause of that routine is executed. First it tells the user about the exception (it displays "invalid" on the console), then resumes {APPLICATION}.calculate_until_exit.
Resuming means the {CALCULATOR}.reset command is applied on the now non-dirty object. In practice, that means the calculator is returned to a reasonable state.
Normal processing can then continue.

With the new exception model, designing the program is very easy. Rescue clauses are simple, there is no twisted logic. It looks like sequential exception handling so much that most programmers would probably feel at ease when debugging their code.

A quick note about the {CALCULATOR}.has_crashed attribute: this attribute is necessary because, after the dirty flag has been cleared, only a call to {CALCULATOR}.reset should be allowed. Remember that {CALCULATOR}.last_value doesn't contain anything meaningful, it's probably the result of the previous operation and therefore there should be no allowed operations on the calculator objects in this state, until it's reset.

If CALCULATOR used NaN, has_crashed would not be necessary. The code would look like this:

class CALCULATOR create make feature {NONE} -- Initialization make do last_value := 0.0 end feature {NONE} last_value: DOUBLE feature current_value: DOUBLE require last_value /= {DOUBLE}.NaN do result := last_value end feature reset do last_value := 0.0 ensure last_value = 0 end feature ... divide (fp: DOUBLE) require last_value /= {DOUBLE}.NaN do last_value := last_value / fp ensure last_value = old last_value / fp rescue last_value := {DOUBLE}.NaN end end

We will avoid NaN. has_crashed makes more sense. NaN is for languages that don't have exception handling (they don't deserve such contempt, SCOOP exception handling isn't exactly gold after all...)

Multi-client calculator

So far our calculator hasn't been much of a improvement performance-wise. If two clients need a calculator, each of them must have its own copy, so every time one of the calculators is invoked it must reset the co-processor before it can perform the request. If only to clear a divide by zero exception flag (on the co-processor) that would have happened because of the other calculator.

A multi-client version is in order. Here is the code, with APPLICATION and USER_IO identical to the single-client version.

class CALCULATOR create make feature {NONE} -- Initialization make do has_crashed := false end feature has_crashed: BOOLEAN current_value: DOUBLE require not has_crashed do result := accelerator_last_value(accelerator, register(accelerator)) end accelerator_last_value (an_accelerator: separate HARDWARE_CALCULATOR; a_register: INTEGER): DOUBLE do result := an_accelerator.last_register_value(a_register) end feature reset do reset_accelerator(accelerator, register(accelerator)) has_crashed := false ensure not has_crashed end reset_accelerator (an_accelerator: separate HARDWARE_CALCULATOR; a_register: INTEGER) do an_accelerator.reset(a_register) end feature add (n: DOUBLE) require not has_crashed do add_using_accelerator(accelerator, register(accelerator), n) rescue has_crashed := true end subtract (n: DOUBLE) require not has_crashed do subtract_using_accelerator(accelerator, register(accelerator), n) rescue has_crashed := true end multiply (n: DOUBLE) require not has_crashed do multiply_using_accelerator(accelerator, register(accelerator), n) rescue has_crashed := true end divide (n: DOUBLE) require not has_crashed do divide_using_accelerator(accelerator, register(accelerator), n) rescue has_crashed := true end feature {NONE} -- Initialization accelerator: separate HARDWARE_CALCULATOR once ("PROCESS") create result.make end register (an_accelerator: separate HARDWARE_CALCULATOR): INTEGER once ("OBJECT") result := an_accelerator.reserve_register end add_using_accelerator (an_accelerator: separate HARDWARE_CALCULATOR; a_register: INTEGER; n: DOUBLE) do an_accelerator.add(a_register, n) end subtract_using_accelerator (an_accelerator: separate HARDWARE_CALCULATOR; a_register: INTEGER; n: DOUBLE) do an_accelerator.subtract(a_register, n) end multiply_using_accelerator (an_accelerator: separate HARDWARE_CALCULATOR; a_register: INTEGER; n: DOUBLE) do an_accelerator.multiply(a_register, n) end divide_using_accelerator (an_accelerator: separate HARDWARE_CALCULATOR; a_register: INTEGER; n: DOUBLE) do an_accelerator.divide(a_register, n) end endclass HARDWARE_CALCULATOR create make feature {NONE} -- Initialization make do create reserved_registers.make_filled(false, 0, max_registers) create registers.make_filled(0, 0, max_registers) end reserved_registers: ARRAY[BOOLEAN] registers: ARRAY[DOUBLE] feature reserve_register: INTEGER local i: INTEGER do from i := 0 until i >= max_registers or not reserved_registers[i] loop i := i + 1 end if i < max_registers then result := i else (create {FLOATING_POINT_FAILURE}).raise end end reset (a_register: INTEGER) require a_register >= 0 and a_register < max_registers do registers[a_register] := 0 end feature add (a_register: INTEGER; n: DOUBLE) require a_register >= 0 and a_register < max_registers do registers[a_register] := registers[a_register] + n ensure registers[a_register] = old registers[a_register] + n end subtract (a_register: INTEGER; n: DOUBLE) require a_register >= 0 and a_register < max_registers do registers[a_register] := registers[a_register] - n ensure registers[a_register] = old registers[a_register] - n end multiply (a_register: INTEGER; n: DOUBLE) require a_register >= 0 and a_register < max_registers do registers[a_register] := registers[a_register] * n ensure registers[a_register] = old registers[a_register] * n end divide (a_register: INTEGER; n: DOUBLE) require a_register >= 0 and a_register < max_registers do registers[a_register] := registers[a_register] / n ensure registers[a_register] = old registers[a_register] / n end last_register_value (a_register: INTEGER): DOUBLE require a_register >= 0 and a_register < max_registers do result := registers[a_register] end feature max_registers: INTEGER = 7 end

The HARDWARE_CALCULATOR class is designed to use a feature of the Intel 80487 co-processor: it has 8 free registers, 7 of them can be used to store a value for up to 7 clients, and register 0 is used to hold the second operand of the current operation.

Say SP(0) to SP(7) are the 8 registers, and SP is a shortcut for SP(0), then the following can be executed:

Store the second operand on SP
Execute Fop SP, SP(x) when Fop is the operation (FADD, FDIV, etc.) and x the register associated to a client, from 1 to 7.
Execute XCHG SP, SP(x) to exchange register 0 and register x, which has the effect of storing the result in register x, since register 0 will be overwritten by the next operation.

The co-processor pipe is designed is such a way that any XCHG following a Fop takes no cycle. Therefore, the total cost of an operation is the cost of store + Fop. This leads to a dramatic increase of performance.

In my example, I haven't used floating-point assembly opcodes, but basically every line such as registers[a_register] := registers[a_register] / n

represents steps 1 to 3.

Before we begin the discussion about what happens during an exception, let me clear some confusion about the code.

Once functions have the "PROCESS" and "OBJECT" tags. The once function with the "PROCESS" tag is there is create a single hardware calculator that will be used by all instances of CALCULATOR. The once function with the "OBJECT" tag is there to reserve one of the 7 registers, it's executed once per instance.
There is no routine to release a reserved register. Such a routine would be called during a dispose call for an instance of CALCULATOR ready to be collected. I did not want to add too much code to an already big post, so I skipped it. But note that if the hardware calculator is dirty at the time of the dispose call, this would lead to register leaks. Therefore, HARDWARE_CALCULATOR should have is_reserved(a_register) and release(a_register) features, and in {CALCULATOR}.dispose we should loop until the register is effectively released.
There are only 7 registers that can be reserved. Therefore, if a 8th calculator instance is created and tries to perform a floating-point operation, the code will crash first when trying to reserve a register. This is the purpose of the call to {FLOATING_POINT_FAILURE}.raise. Because of this, even a simple add() can generate an exception. But since the code of add, subtract, multiply and divide is almost the same, I will discuss exceptions in divide only. Both the software and hardware exceptions execute the same code.
Ensure clauses have been moved from CALCULATOR to HARDWARE_CALCULATOR. This is because last_value is no longer an attribute of CALCULATOR. This raises interesting questions about how to specify a contract when that would force the code to synchronize processors. Are separate preconditions so harmless? Stay tuned...

This time, we use a different exception model that fits better multi-client/single-server objects.

If an exception occurs in routine r, at the end of r (if there was any retries, at the end of r when the rescue code didn't retry), the object's dirty flag is set.
A difference with the single client model is that, along with the dirty flag set, the identifier of the processor of the faulty caller is saved.
Commands applied to a dirty object are treated differently depending on the originating processor: if they don't come from the same processor as the faulty caller, they are kept in the queue, but not executed. Otherwise the first of them is executed, and will then clear the dirty flag if it ends normally (i.e. without a new exception).
Queries applied to a dirty object are treated differently as well, depending on the originating processor: similarly to commands, queries that don't come from the same processor as the faulty caller are enqueued. Otherwise, and now contrary to the single-client exception model, queries trigger an exception on the caller, but don't clear the dirty bit.

Can it be made more complicated? Yes! I'll refine the model in the last section. It actually makes it both less and more complicated, in some way. But let's not get ahead of ourselves.

Applying the new exception model to our calculator, it can be summarized in two execution flows:

Commands and queries that come from processors other than the processor that was at the origin of the exception are just queued.
Queries coming from the same processor as the faulty caller trigger an exception in the caller.
The first commands coming from the same processor as the faulty caller will be executed and then clear the dirty flag.

The role of commands and queries has changed. However if there was only one calculator at any given time (like for a single client design), the executed code would be the same. Let's see why:

Say a divide by zero exception occurs in {HARDWARE_CALCULATOR}.divide. The hardware calculator is flagged as dirty.
The call to {HARDWARE_CALCULATOR}.last_register_value triggers an exception, escalated so that {CALCULATOR}.current_value ends with the calculator flagged as dirty as well.
Since {CALCULATOR}.current_value is a query, it generates an exception in {APPLICATION}.calculate_until_exit and the rescue code gets executed.
{CALCULATOR}.reset is executed, so is {HARDWARE_CALCULATOR}.reset. At the end, both the hardware calculator and the calculator have the dirty flag reset.
Operations resume normally.

This model is good to handle calls coming from multiple processors at a time. It has a couple obvious flaws, though.

First of all, if two commands follow each other, and the first one triggers an exception, the next one will reset the dirty flag. This is clearly not what we want.Second of all, if the faulty caller decides to give up and to use a new calculator, other processors are stuck with an object that cannot be collected, it's a deadlock.Third of all, what should be done if the processor that can unlock the dirty object disappears itself? This is another source of deadlock.

About the deadlock objections, it should be noted that if there are no rescue clauses in the call chain, and processor chain of requests, eventually either the root object will report an exception, or a routine without caller (more precisely, with a caller on a processor that no longer exists) will be found. In both cases, it seems justified to halt the program.

The consequence is that a routine that uses an object designed to be called concurrently should be aware of possible exceptions , and clean said object up. But this is no news to programmers that already must release resources they acquired.

The first objection correctly points at the fact that the first command after an exception is given a special role it might not be able to perform. In the CALCULATOR example, we know that the exception is immediately followed by a call to reset, but in fact what we want from reset is that it clears the dirty flag, regardless of the circumstances in which it's called.

In other words, we want to give a special role to the reset routine specifically. The rules of the exception model would make more sense if we could say all commands are ignored but reset, at the end of which the dirty flag is cleared.

There are already features that have a specific meaning in classes: creation procedures. Therefore, perhaps a good idea would be to add reset in the create clause of the class:class CALCULATOR create make, reset feature ...

I realize the purpose of reset is not to initialize the object. However, it's not really the role of creation procedures either. Their purpose is to ensure the invariant, i.e. make sure they return leaving the object in a valid state. This is pretty much what reset is required to do.

About processor queues

In the description of the exception model, I mention that some features would be queued on the processor queue, those coming from other processors, and that a call to the cleanup feature would be processed immediately, to clear the dirty bit.

Usually, executing features out-of-order is frowned upon. However, in this case, we're comparing feature calls that already come out of order on the queue. While they may be synchronized between themselves, features called from other processors are (theoretically) not synchronized with the cleanup feature. The only ones that are either are ignored (commands) or generate exceptions before even reaching the queue (queries).

A difficulty that has been overlooked is that queues are associated to processors, not objects. That is, a dirty object effectively block all other objects on the same processor from receiving calls. This could become a problem if these objects are needed for cleanup.

Take a system with processors Px and Py, each holding objects O1x, O2x, .., O1y, etc. If O1x calls O1y, generate an exception that makes O1y dirty, then query O2y, this leads to a deadlock situation because the dirty object cannot be cleaned up until the pending call to O2y returns, and it won't until the dirty object is cleaned up. Programmers must be aware of this issue, because there is probably no way around it:

Features of non-dirty objects on Py cannot be called, that would mean making calls out-of-order inside a processor queue.
No exception should be generated on query to O2y, this object is working properly.

This enforces a practice on separate calls that is in a way similar to resource locking. For a given order of commands sent to objects, queries to collect the result should use the same order. For instance:

some_routine (y1, y2: separate SOME_TYPE) -- At run-time, y1 and y2 are on the same processor Py local y2_result: INTEGER do y1.some_command y1.another_command y2.yet_another_command y2_result := y2.some_query -- Wrong, should query y1 before, directly or indirectly end

Another issue is that there might be more commands from Px in the queue. For instance, in the example above, if some_command raises an exception, another_command could be already in the queue.

In this case, it is necessary to purge the processor queue from these commands. If the routine that raises an exception is a query, there won't be any other query in the queue, otherwise there might be one. In that case, the query found in the queue is also removed. And since that means the calling processor Px is waiting for the result, we can raise an exception there.

Commands found in the queue that are creation procedures should be discarded as well. They should be used after that the exception is reported to the calling processor, not before.

It should be also noted that while commands sent to a processor queue to act on a dirty object will be discarded (unless they are creation procedures), commands sent to the same queue but for another non-dirty object will not. They will be queued and be on-hold.

Finally, there is the problem of what to do if the command was just a 'fire and forget' type of call, which means there will be no query to trigger an exception in the calling processor. That processor might even go away. In this situation, other processors calling Py are deadlocked.

I believe that, for this situation, it makes sense that no further call should be applied to the dirty object, since it'll stay dirty forever. The correct design in that case will be to have all exceptions handled before they have a chance to reach the processor boundary.

Summary and conclusion

I explained why, in the context of SCOOP, a better handling of exceptions would be desirable. Exceptions reaching the boundary of processors are sometimes unavoidable.

The suggested exception handling model works as follow:

Creation routines that don't raise an exception clear the dirty bit of the object they are applied to.
Every time a call is applied to an object, the identifier of the processor that issued the call is recorded. Let's name it the 'calling' processor.
If an exception occurs in routine r, of object O sitting on processor P:
Rescue clauses, retries, and escalating the exception to the caller if on the same processor is performed as usual.
If the exception is still not handled upon return of the last routine before reaching the processor boundary (i. e. no caller on the same processor anymore), then steps below are performed. We will assume that r applied to O is actually the last routine in the call stack.
If O's dirty flag was set, the program is halted, otherwise O's dirty flag is set and the calling processor's identifier is stored in the object. The only situation where the dirty flag could be already set is when executing a creation procedure. No exception is allowed in this case.
A search is performed in the processor queue to find other commands and queries coming from the calling processor. Commands are removed from the queue and discarded. The search continues until no more routine call is found, or it's a query (there can be only one query in the queue from a given processor at any time).
If r is a query, or if there is a query in the queue, the calling processor is currently waiting for the result. An exception is raised in the waiting routine.
A processor with one or more objects with the dirty bit set will not withdraw scheduled calls (to execute them) from its queue unless specified below.
Any attempt to apply a new call on processor P then falls in one of the following categories:
The new call is applied to an object that has the dirty flag not set, or the calling processor identifier is different than the stored processor identifier: the call is queued normally.
The object dirty bit is set, and the calling processor identifier is the same as the stored identifier, then:
- If the called routine is a query, an exception is raised in the caller, and the query is discarded.
- Otherwise, if the called routine is not a creation routine, the call is discarded and no further action is taken.
- Otherwise, the called routine is scheduled as the first routine to execute in the processor queue, and the processor can begin executing calls again. Note that the dirty bit is not cleared yet.

I wish it could be simpler. I also wish I could try to implement it, to see if there is anything I forgot (since it's most likely the case).

I believe that with this mechanism, the code dedicated to exception handling in programs can be kept small, and should cover most of the programmers needs. It's also compatible with the current implementation since in the absence of rescue clauses, the program is aborted as it is today.

It does add some requirements on the code that reduces concurrency. However, having to query results in the same order as the commands were issued sounds like sensible thing to do to me. Out-of-order queries might even be detectable at run-time by the debugger.

References

[1] Bertrand Meyer Reflexivity, and other pillars of civilization

[2] Gary T. Leavens Not a Number of Floating Point Problems