Common features
The following features are common to shared-memory parallelism and distributed-memory parallelism.
Redundant work
No matter how concurrent an algorithm is, there are still parts that will have to be carried out serially. When the results of those serial computations must be shared, you can perform the computation in one process and broadcast the results out to the other processes that require the data. Sending this information will add overhead to the execution time. On the other hand, if the data used in the computation is already available to other processes, you can have each process perform the computation and generate results locally without the need to send any messages. In shared-memory parallelism, the data for computation is likely already available by default. Even though doing redundant work in threads keeps processing resources busy and eliminates extraneous synchronization, there is a cost in memory space needed to hold multiple copies of the same value.
Dividing work
Work must be assigned to threads and process alike. This may be done by assigning a chunk of the data and having each thread/process execute the same computations on the assigned block, or it may be some method of assigning a computation that involves executing a different portion of the code within the application.
Sharing data
There will be times when applications must share data. It may be the value of a counter or a vector of floating-point values or a list of graph vertices. Whatever it happens to be, threads and processes alike will need to it during the course of the computation. Obviously, the methods of sharing data will vary; shared-memory programs simply access a designated location in memory, while distributed-memory programs must actively send and receive the data to be shared.
Static/dynamic allocation of work
Depending on the nature of the serial algorithm, the resulting concurrent version, and the number of threads/processes, you may assign all the work at one time (typically at the outset of the computation) or over time as the code executes. The former method is known as a static allocation since the original assignments do not change once they have been made. The latter is known as dynamic allocation since work is doled out when it is needed. Under dynamic allocation, you may find that the same threads do not execute the same pieces of work from one run to another, while static allocation will always assign the same work to the same threads (if the number of threads is the same) each and every time.
Typically, if the work can be broken up into a number of parts that is equal to the number of threads/processes, and the execution time is roughly the same for each of those parts, a static allocation is best. Static allocation of work is always the simplest code to implement and to maintain. Dynamic allocation is useful for cases when there are many more pieces of work than threads and the amount of execution time for each piece is different or even unknown at the outset of computation. There will be some overhead associated with dynamic allocation scheme, but the benefits will be a more load-balanced execution.
Features Unique to Shared Memory
These next few items are where distributed-memory and shared-memory programming differ. If you’re familiar with distributed-memory parallelism, you should be able to see the differences. For those readers not familiar with distributed-memory parallelism, these points and ideas are still going to be important to understand.
Local declarations and thread-local storage
Since everything is shared in shared-memory, there are times it will be useful to have a private or local variable that is accessed by only one thread. Once threads have been spawned, any declarations within the path of code execution (e.g. declarations within function calls) will be automatically allocated as local to the thread executing the declarative code. Processes executing on a node within a distributed-memory machine will have all local memory within the node.
A thread-local storage (TLS) API is available in Windows threads and POSIX threads. Though the syntax is different in the different threaded libraries, the API allocates some memory per executing thread and allows the thread to store and retrieve a value that is accessible to only that thread. The difference between TLS and local declarations is that the TLS values will persist from one function call to another. This is much like static variable, except that in TLS, each thread gets an individually addressable copy.
Memory effects
Since threads are sharing the memory available to the cores on which they are executing, there can be performance issues due to that sharing. I’ve already mentioned storage conflicts and data races. Processor architecture will determine if threads share or have access to seperate caches. Sharing caches between two cores can effectively cut in half the size of the cache available to a thread, while separate caches can make sharing of common data less efficient. On the good side, sharing caches with commonly accessed, read-only data can be very effective, since only a single copy is needed.
False sharing is a situation where threads are not accessing the same variables, but they are sharing a cache line that contains different variables. Due to cache coherency protocols, when one thread updates a variable int the cache line and another thread wants to access something else in the same line, that line is first written back to memory. When two or more threads are repeatedly updating the same cache line, especially from seperate caches, that cache line can bounce back and forth through memory for each update.
Communication in memory
Distributed-memory programs share data by sending and receiving messages between processes. In order to share data within shared memory, one thread simply writes a value into a memory location and the other thread reads the value out of that memory location. Of course, to ensure that the data is transferred correctly, the writing thread must deposit the value to be shared into memory before the reading thread examines the location. Thus, the threads must synchronize the order of writing and reading between the threads. The send-receive exchange is an implicit synchronization between distributed processes.
Mutual exclusion
In order to communicate in memory, threads must sometimes protect access to shared memory locations. The means for doing this is to allow only one thread at a time to have access to shared variables. This is known as mutual exclusion. Several different synchronization mechanisms are available (usually dependent on the threading method you are using) to provide mutual exclusion.
Both reading and writing of data must be protected. Multiple threads reading the same data won’t cause any problem. When you have multiple threads writing to the location, the order of the updates to the memory location will determine the value that ultimately stored there and the value that will be read out of the location by another thread. When you have one thread reading and one thread writing to the same memory location, the value that is being read can be one of two values (the old value or the new value). It is likely that only one of those will be the expected value, since the original serial code expects only one value to be possible. If the correct execution of your threaded algorithm depends on getting a specific value from a variable that is being updated by multiple threads, you must have logic that guarantees the right value is written at the correct time; this will involve mutual exclusion and other synchronization.
Producer/consumer
One algorithmic method you can use to distribute data or task to the processes in distributed-memory programs is boss/worker. Worker processes send a message to the boss process requesting a new task; upon receipt of the request, the boss sends back a message/task/data to the worker process. You can write a boss/worker task distribution mechanism in threads, but it requires a lot of synchronization.
To take advantage of the shared memory protocols, you can use a variation of boss/worker that uses a shared queue to distribute tasks. This method is known as producer/consumer. The producer thread creates encapsulated tasks and stores them into the shared queue. The consumer threads pull out tasks from the queue when they need more work. You must protect access to the shared queue with some form of mutual exclusion in order to ensure that tasks being inserted into the queue are placed correctly and tasks being removed from the queue are assigned to a single thread only.
Readers/writer locks
Since it is not a problem to have multiple threads reading the same shared variables, using mutual exclusion to prevent multiple reader threads can create a performance bottleneck. However, if there is any chance that another thread could update the shared variable, mutual exclusion must be used. For situations where shared variables are to be updated much less frequently than they are to be read, a reader/writer lock would be the appropriate synchronization object.
Readers/writer locks allow multiple reader threads to enter the protected area of code accessing the shared variable. Whenever a thread wishes to update (write) the value to the shared variable, the lock will ensure that any prior reader threads have finished before allowing the single writer to make the updates. When any writer thread has been allowed access to the shared variable by the readers/writer lock, new readers or other threads wanting write access are prevented from proceeding until the current writing thread has finished.