Thread-Safe ACIS (Part 2)

From DocR23

Jump to: navigation, search

Contents

How to use TS-ACIS

Thread Manager

We recommend the use of the ACIS thread manager for numerous reasons. It is simple to use, easily understood, and tailored to ACIS. Other thread management schemes are supported but they may hinder collaborative efforts and complicate the process of reporting incidents.

The ACIS thread manager implements a producer/consumer queue with the following behaviors. Only the master thread can add work to the queue. The master thread only produces work for the queue when worker threads are available, otherwise the master consumes as it produces, which restores serial behavior. Also, work is never overproduced. In other words, only enough work is produced to occupy available worker threads.

The thread manager has one pure virtual function called process. Implement this method in derived classes that perform the work and produce the results for the operation. This is the only method that executes concurrently when worker threads exist. The thread manager run method, which is called only by the master thread, will schedule the process method to be called by worker threads and returns. It directly calls the process method when worker threads are not available.

Thread Manager Implementation Steps

It is very straightforward to add parallel regions to algorithms using the ACIS thread manager. The steps are:

  • Identify the potential parallel operation and associated data.
  • Encapsulate the independent inputs and outputs of each operation into a data structure (work packet).
  • Develop a class with a method that processes each work packet.
  • Add a method that performs the independent operation on work packets.
  • Add a method to combine all the outputs from the work packets.
  • Derive the new class from the thread manager base class: thread_work_base.
  • Ensure the correctness of the operations without threads.
  • Ensure the correctness of the operations with threads.

Thread Manager Example

The following is a trivial example of using the thread manager. The code adds all the whole numbers from one to one million. The serial code is shown below:

int run()
{
    int answer = 0;
    for ( int i = 1; i < 1000000; i++ )
    {
	answer += i;
    }
 
    return answer;
}

To modify this example to use the ACIS thread manager we must first decide how to distribute the work amongst available threads. The type of work in this example, an addition operation, has equal complexity and distribution. It therefore lends itself well to either cyclic or block scheduling. We will chose cyclic scheduling, where each thread begins at a unique point and increments to its next value by adding the number of threads involved in the calculation. This works well in this situation because it doesn't matter in which order the numbers are added and it requires the least amount of information to be passed to each thread.

We begin by deriving a custom class from the thread manager base class thread_work_base. To this we add an array of integers to hold the thread specific totals. This will give each thread an independent location to store its data where it is easily accessible from the master thread. We have sized this array to the maximum number of threads that ACIS can utilize using the MAX_ACTIVE_THREADS symbol, which is currently set to 1024.

We also add an integer to hold the number of threads used in the calculation. This makes sense because we only want to calculate it once as it is then essentially a read-only value. We must consider that the master thread will not do any of the work because it only produces work when worker threads are available. In this case we must subtract one from the total. Note that the value will be one when no workers are available (because the master thread will do all the work) and when only one worker is available (because the one worker will do all the work).

The run method of the derived class, which by the way does not have to be called run, has the same external behavior as the original function in that it sums all the numbers from one to one million and returns the answer. The fact that we use threads to compute the answer is hidden in the implementation. The method first calculates the number of threads to use, then schedules work (puts it in the queue) by calling the base class run method; passing along the starting index for each thread. It then waits for all threads to complete the operation, adds the computed values together, and returns the answer.

The process method of the derived class, which is called by worker threads when the thread management system schedules the work in the queue, receives the starting index passed through the run method as input. This represents the first of the numbers to be added together. We initialize the value where the answers are placed to zero, then add together all numbers of the set each thread is responsible for by incrementing our step by the number of worker threads.

The example below demonstrates cyclic scheduling. As an alternative we could have used block scheduling, where each thread works on an equally sized and contiguous data set. In this case we would have divided the one million numbers by the number of threads, letting each sum the contiguous values from some staring point to some stopping point. The code would not have been much different. The pros and cons of choosing one over the other are discussed below.

class my_thread_work : public thread_work_base
{
    int answers[MAX_ACTIVE_THREADS];	
    int num_threads;				
 
protected:
 
    void process( void* arg )
    {
	int index = (int)arg;
	answers[index] = 0;
	for ( int i = index; i < 1000000; i += num_threads )
	{
	    answers[index] += i;
	}
    }
 
public:
 
    int run()
    {
	num_threads = thread_count() == 1 ? 1 : thread_count() - 1;
 
	int i;
	for ( i = 0; i < num_threads; i++ )
	{
	    thread_work_base::run( (void*)i );		
	}
 
	thread_work_base::sync();
 
	for ( i = 1; i < num_threads; i++ )
	{
	    answers[0] += answers[i];
	}
 
	return answers[0];
    }
};

This trivial example is valuable because it demonstrates common practices one encounters when developing parallel code regions. These are: dividing work, processing work, and aggregating results. The thread manager simplifies the process by providing structure. Roles and responsibilities are clear: only the master thread can produce work and worker threads are only consumers. Behavior is easily understood: the code works with and without worker threads and the process method is the only code that is executed concurrently.

The drawback of this example is that it does not demonstrate the benefits of scaling simply because it does not perform the operation faster with threads. This is because the operation itself is not time consuming enough to warrant the added overhead of thread management. This is an example of fine-grain parallelism that requires bare-bones thread management tactics with little overhead. The ACIS thread manager is designed to be used for coarse-grain parallelism, where thread management overhead is minuscule in relation to the operation.

Thread-Local Storage

ACIS uses internally developed template objects called safe types to implement thread-local storage (TLS). Safe types exist for all basic data types and provide thread specific values for each variable. Safe types must be global in scope and must have been initialized (constructed) before the ACIS modeler is started (or the base is initialized).

The safe type values are initialized to the value assigned during static construction. The master thread's value is stored in a "current value" (currval) variable within the safe type object. (This enhances performance in non-threaded code regions and is useful when debugging.) Thread specific values are located in storage outside of the object and are therefore not directly visible.

Thread specific values are fetched only when safe types are accessed within thread safe regions. These regions are defined by calling the functions thread_safe_region_begin and thread_safe_region_end. The thread manager automatically defines thread safe regions in its run method. The use of regions eliminates the overhead of TLS accesses when they are not needed.

ACIS supports the following safe types:

Name Use
safe_integral_type int, short, char, signed or unsigned
safe_floating_type float or double
safe_pointer_type Pointer to non-class data
safe_object_pointer Pointer to a class object
safe_function_type Pointer to a function

Safe type Example

The following is an example use of thread-local storage with safe types. We use the thread manager to schedule a large number of concurrent calls to the process method of a custom thread work class. The process method, in the first pass, increments the thread specific value of the safe type. In the second pass each thread prints out its accumulated value.

safe_integral_type<int> thread_local_integer;
 
class my_thread_work : public thread_work_base
{
 
protected:
 
    void process( void* arg )
    {
	if ( (int)arg == 0 )
	{
	    thread_local_integer++;
	}
	else
	{
	    printf( "Thread %d scheduled %d times\n", 
		thread_id(), (int)thread_local_integer );
	}
    }
 
public:
 
    void run()
    {
	int i;
	for ( i = 0; i < 1000000; i++ )
	{
	    thread_work_base::run( (void*)0 );		
	}
	thread_work_base::sync();
	for ( i = 0; i < thread_count() - 1; i++ )
	{
	    thread_work_base::run( (void*)1 );		
	}
	thread_work_base::sync();
    }
};

Several aspects of this example are worth pointing out: the safe type is defined at a global scope as required and the safe type supports the operations of the basic type it encapsulates. The former facilitates in proper and timely initialization and storage allocation, the later eliminates the need for special syntax when using safe types.

In fact, safe types can be used just as native types with one known exception. One must explicitly cast safe types in non-type-specific operations, such as when used with printf as in the example above. printf is a variadic function, in that it supports a variable number of arguments. The arguments furthermore are not accessed in a type specific manner, which eliminates the automatic call to the type operator in an object. Hence the need for an explicit cast.

Also worth mentioning is the non-deterministic behavior of this example as shown below. Each thread will most likely print out different count values and the order of the output will be different every time the program is executed. This is largely affected by the Operating System scheduler and is an expected and accepted consequence of multi-threading.

Example output:

Thread 2 scheduled 125008 times
Thread 8 scheduled 125000 times
Thread 3 scheduled 124999 times
Thread 1 scheduled 124982 times
Thread 7 scheduled 125005 times
Thread 4 scheduled 125001 times
Thread 5 scheduled 125001 times
Thread 6 scheduled 125004 times
Press any key to continue . . .

Mutual Exclusion Logic (Mutex)

ACIS uses class objects to implement mutual exclusion logic. The base class, called mutex_resource, encapsulates the OS specific mutex implementation (i.e., the resource). The Windows implementation uses the CRITICAL_SECTION runtime object, which yields CPU utilization of threads waiting to acquire the mutex resource.

The mutex_resource class has methods to acquire and release the mutex resource directly, but is more commonly used in conjunction with another ACIS class object called mutex_object. This class object calls the mutex_resource acquire method when constructed and calls the mutex_resource release method when destructed. This helps assure that the mutex resource is properly released.

Finally, the ACIS CRITICAL_BLOCK macro helps simplify the use of mutexes by automating the construction of mutex_object objects with the proper coupling to mutex_resource objects. This is the preferred method of implementing mutex blocks.

Mutex Example

The following example uses ACIS mutex objects. We use the thread manager to schedule a large number of concurrent calls to the process method of a custom thread work class. The process method simply increments a global integer. Since this is not an atomic operation, we must use a mutex to assure that the operation is not performed concurrently.

mutex_resource my_mutex;
 
int global_integer;
 
class my_thread_work : public thread_work_base
{
 
protected:
 
    void process( void* arg )
    {
	CRITICAL_BLOCK( my_mutex );
	global_integer++;
    }
 
public:
 
    void run()
    {
	int i;
	for ( i = 0; i < 1000000; i++ )
	{
	    thread_work_base::run( (void*)0 );		
	}
	thread_work_base::sync();
	printf( "Accumulated value: %d\n", global_integer );
 
    }
};

This code would create a race condition without the mutex because multiple threads would read from and write to the variable concurrently in a non-deterministic manner. The program outputs different values without the mutex but always outputs the same and correct value with the mutex in place. We use the mutex to force an atomic operation and restore determinacy.

The performance impact of the mutex in this case will be significant because the overhead of acquiring and releasing it is certainly more costly than the addition operation it protects. In other words, we have extreme contention for a trivial operation. It is advantageous to use an alternate approach that avoids synchronization, as in earlier examples where threads increment unique variables.

Multi-Model Operations

TS-ACIS supports concurrent operations on independent data. We define independent data as a collection of entities belonging to a particular history stream, without connectivity to entities in other streams. In ACIS this independent data represents a model. Working with a single model is called part modeling and working with multiple models is typically called assembly modeling. (In this context assembly modeling does not refer to ACIS assembly management functionality.)

Applications that support assembly modeling are good candidates for TS-ACIS because they probably have workflows that are conducive to multi-threading. Loading and viewing assemblies, for example, may be a very common operation. This involves restoring and faceting groups of ACIS models, which is parallelizable.

Multi-Model Example

The following example uses the ACIS thread manager to load and facet models of an assembly concurrently. Much of the implementation is omitted for simplification. The assembly_info class contains a pointer to a linked list of model_info objects. The model_info class contains a pointer to a target history stream, a handle to the model file, and a list to hold top-level entities.

class my_thread_work : public thread_work_base
{
    assembly_info* assembly;
 
protected:
 
    void process( void* arg)
    {
	model_info* model = (model_info*)arg;
	api_set_default_history( model->stream() );
	api_restore_entity_list( model->file(), ...
	api_facet_entity( model->entity() );
    }
 
public:
 
    void run()
    {
	model_info* mi = assembly->first_model();
	while (mi)
	{
	    thread_work_base::run( (void*)mi );		
	    mi = mi->next_model();
	}
	thread_work_base::sync();
    }
};

We implement a run method to iterate through all models of an assembly. The master thread passes the addresses of the model_info objects to the thread manager base class run method, which schedules the work, and then waits for the work to complete by calling the sync method.

The process method, which executes concurrently when worker threads are available, casts its input argument to a model_info object, sets its default history stream to that of the model, restores the model, and facets it.

The model_info class design encapsulates all information required to perform the operation in isolation. We associate a history stream with the model so that other threads can in turn work on the model without mixing history data (mixing streams). Setting the default history stream to that of the model (i.e., activating the stream) is the very first thing each thread must do before working on a particular model.

There may be other model specific information, in addition to the history stream, to consider in scheduled operations. The unit representations, for example, may be different. One model might be modeled in millimeters while another is modeled in inches. This must be accounted for. ACIS provides a modeler_state helper object that currently contains a snapshot of the tolerance variables and option_header values. Use this in conjunction with a custom implementation to represent the true state of a model.

Single-Model Operations

Employing threads in single-model operations is more complex than in multi-model operations because the model data is not independent. Each thread must make a deep copy of the portion of the model it requires to perform the desired operation. The benefits of parallelization therefore must outweigh the expense of copying the data.

One must also consider whether the results of the operation are to be reflected in the original model. Most output data, from evaluation operations anyway, is independent of entity references. A point-in-face operation for example, inputs a face and a position and outputs containment information, which is an enumerated type. This simple output data can easily be stored in the output data object (work packet) for future use.

An entity-point-distance operation on the other hand, inputs an entity and a position and outputs closest point information, containing a pointer to an entity that may be different from the input entity. This type of output data is useless to the master thread when the temporary entity copies are lost. Care must therefore be taken to map the entity of the copy to the corresponding entity of the original.

To illustrate this entity mapping let us consider a scenario where we have deep copied a face for an entity-point-distance operation and receive an edge pointer in the output data. To map the copied edge belonging to the copied face to the original edge belonging to the input face, we can rely on the ordering of the ENTITY_LISTs returned by calling api_get_edges on both the copy face and original face. The mapping is then reduced to finding the index of the edge in the copy list and looking up the corresponding edge in the original list.

Other operations may require entity modification along with history data from worker threads to be merged into the master thread's history stream. This requires many restrictions in order to maintain consistent history information. For example, only one set of changes per original entity are permitted, otherwise the merges cannot be reconciled. Use merge_child_state to merge the worker thread's history stream with the master thread's history stream.

Single-Model Example

The following example uses the ACIS thread manager to facet the faces of an input body concurrently. It demonstrates the use of deep copy to create independent model data, how to use a mutex to avoid race conditions when modifying shared model data, and merging child history stream data into the master stream. As before, much of the implementation is omitted for simplification.

class facet_body_thread_work : public thread_work_base
{
    struct face_info
    {
	FACE* input_face;
	HISTORY_STREAM* stream;
	int status;
    };
 
    face_info* face_info_array;
    mutex_resource mutex;
 
protected:
 
    void process( void* arg )
    {
	face_info& info = face_info_array[(int)arg];
	HISTORY_STREAM* current_stream;
	api_get_default_history( current_stream );
	api_set_default_history( info.stream );
	info.stream->set_distribute_flag( TRUE );
	API_BEGIN
	    FACE* input_face = info.input_face;
	    FACE* copy_face = NULL;
	    api_deep_down_copy_entity( input_face, (ENTITY*&)copy_face, TRUE );
	    api_facet_entity( copy_face );
	    ATTRIB* att = copy_face->attrib();
	    {
		CRITICAL_BLOCK( mutex );
		att->move( input_face );
	    }
	    api_del_entity( copy_face );
	API_END
	info.status = result.error_number();
	api_set_default_history( current_stream );
    }
 
public:
 
    void facet_body_with_threads( BODY* body )
    {
	ENTITY_LIST face_list;
	api_get_faces( body, face_list);
	int face_count = face_list.count();
	face_info_array = ACIS_NEW face_info[face_count];
	int i;
	for ( i = 0; i < face_count; i++ )
	{
	    face_info_array[i].input_face = (FACE*)face_list[i];
	    api_create_history( face_info_array[i].stream );
	    thread_work_base::run( (void*)i );
	}
	thread_work_base::sync();
	for ( i = 0; i < face_count; i++ )
	{
	    if (face_info_array[i].status == 0)
		merge_child_state( face_info_array[i].stream );
	    api_delete_history( face_info_array[i].stream );
	}
	ACIS_DELETE [] face_info_array;
    }
};

The run method, in this case called facet_body_with_threads, accepts an input body, creates storage to hold the data for the scheduled operations (work packets), then schedules the work and waits for it to complete, and finally merges all entity modifications into the masters history stream.

The process method sets the appropriate history stream, deep copies the input face, facets the copied face, moves the facet attribute from the copy to the input face, and deletes the copied face.

Important details to notice are setting the distribution flag on the target stream to TRUE, which allows the stream to log entity changes from multiple streams (mixed streams), and using a mutex via a CRITICAL_BLOCK to remove concurrency while moving the facet data attribute from the copy face to the input face.

Initializing the Thread Manager Example

The following example demonstrates the initialization of the thread manager and additionally incorporates the use of the code that adds all numbers up to one million as shown earlier.

int start_acis() 
{
    return api_start_modeller(0).ok();
}
 
int stop_acis()
{
    return api_stop_modeller().ok();
}
 
int main()
{
    int num_threads = 2;
 
    start_acis();
 
    unlock_license();
 
    thread_work_base::initialize( num_threads, start_acis, stop_acis );
 
    my_thread_work thread_work;
 
    int total = thread_work.run();
 
    printf( "Adding all numbers equals %d\n", total );
 
    thread_work_base::terminate();
 
    stop_acis();
 
    return 0;
}

The master thread starts the modeler, unlocks the ACIS license, and initializes the thread manager with the number of threads to use, as well as initialization and termination routines. Each thread created calls the initialize routine and then waits to take work from the queue. The master thread then constructs a thread work object and calls its run method. This schedules the work to be processed by available threads by placing it in the queue. The master thread then waits for the work to complete, prints out the calculated answer, terminates the thread manager, thereby terminating all worker threads, and stops the modeler.

Personal tools