OpenMP provides a straight-forward interface to write software that can use multiple cores of a computer. Using OpenMP you can write code that uses all of the cores in a multicore computer, and that will run faster as more cores become available.
OpenMP is a well-established, standard method of writing parallel programs. It was first released in 1997, and is currently on version 3.0. It is provided by default in nearly all compilers, e.g. the gnu compiler suite (gcc, g++, gfortran), the Intel compilers (icc, icpc, ifort) and the Portland Group compilers (pgc, pgCC, pgf77) and works on nearly all operating systems (e.g. Linux, Windows and OS X).
You can read about the history of OpenMP at its Wikipedia page, or by going to one of the many OpenMP websites. The best book to learn OpenMP in my opinion is Using OpenMP: Portable Shared Memory Parallel Programming (I used it to learn OpenMP).
OpenMP can be combined with other parallel programming technologies, e.g. MPI. This course is presented as a companion to my MPI course, with both this and the MPI course following a similar structure and presenting similar examples. If you want to compare OpenMP and MPI, then please click on the Compare with MPI links on each page.
This is a short course that will provide a taster of OpenMP. The course is arranged with examples in C, C++ and Fortran. Feel free to look at just the language you are familiar with, or, if you are interested, feel free to compare OpenMP use in each of the languages. Please also feel free to work through this course at your own pace. OpenMP is best learned by using it, so please copy out and play with the examples provided, and also have a go at the exercises.
The first stage is to write a small OpenMP program. Choose from C, C++ or Fortran to write the program hello_openmp. Once you have written and compiled hello_openmp return to this page.
Now that you have the compiled the hello_openmp executable, the next step is to run it. The way to run the executable is the same regardless of the programming language used to produce it. The first step is to set the number of threads of execution that you want to run in parallel. This is achieved by setting the environmental variable OMP_NUM_THREADS to the desired number of threads, e.g. in the bash shell;
export OMP_NUM_THREADS=4
or in the tcsh shell;
setenv OMP_NUM_THREADS 4
Now that you have set the number of threads, run the executable by typing;
./hello_openmp
You should see the following output.
Hello OpenMP! Hello OpenMP! Hello OpenMP! Hello OpenMP!
The line Hello OpenMP! is output four times, as the program split into four threads, and each thread printed Hello OpenMP!. Play with the number of threads (by changing OMP_NUM_THREADS) and see how that affects the output.
Most C compilers support the use of OpenMP. Available compilers include gcc (version 4.2 or above), icc and pgc.
The first step is to create a simple OpenMP C program, which we will call hello_openmp. Open a text editor (e.g. pico), create a file called hello_openmp.c and copy in the following code;
#include <stdio.h> int main(int argc, char **argv) { #pragma omp parallel { printf("Hello OpenMP!\n"); } return 0; }
The only new line in this example is #pragma omp parallel, which is used to specify that all of the code between the curly brackets is part of an OpenMP parallel section.
You can compile this program using one of these commands (choose one for the compiler you wish to use);
This will produce the executable, hello_openmp.
Now return to the previous page to see how to run hello_openmp in parallel.
Most C++ compilers support the use of OpenMP. Available compilers include g++ (version 4.2 or above), icpc and pgCC.
The first step is to create a simple OpenMP C++ program, which we will call hello_openmp. Open a text editor (e.g. pico), create a file called hello_openmp.cpp and copy in the following code;
#include <iostream> int main(int argc, const char **argv) { #pragma omp parallel { std::cout << "Hello OpenMP!\n"; } return 0; }
The only new line in this example is #pragma omp parallel, which is used to specify that all of the code between the curly brackets is part of an OpenMP parallel section.
You can compile this program using one of these commands (choose one for the compiler you wish to use);
This will produce the executable, hello_openmp.
Now return to the previous page to see how to run hello_openmp in parallel.
Most Fortran compilers support the use of OpenMP. Available compilers include gfortran (version 4.2 or above), ifort, and pgf77/90.
Note that OpenMP is available for all flavours of Fortran (e.g. Fortran 77, Fortran 90, Fortran 2008), and is used in the same way in each of these flavours. In this course, all of the examples presented will be standard Fortran 77.
The first step is to create a simple OpenMP Fortran program, which we will call hello_openmp. Open a text editor (e.g. pico), create a file called hello_openmp.f and copy in the following code;
program main implicit none C$OMP PARALLEL print *,"Hello OpenMP!" C$OMP END PARALLEL end
The only new lines in this example is C$OMP PARALLEL and C$OMP END PARALLEL, which is used to specify that all of the code between these two lines is part of an OpenMP parallel section.
You can compile this program using one of these commands (choose one for the compiler you wish to use);
This will produce the executable, hello_openmp.
Now return to the previous page to see how to run hello_openmp in parallel.
So what was going on in the last example?
A standard program works by executing one line of code at a time, starting from the main function (or "program" function in Fortran) and working down line by line. This single thread of execution is known as the "main" thread. All programs have a single main thread, and in most programs, this is the only thread of execution, hence why the program can only do one thing at a time.
The hello_openmp also has a single main thread of execution. However, this main thread is split into a team of threads within the OpenMP parallel section. Each parallel thread in the team executes all of the code in the parallel section, hence each thread executes the line of code that prints Hello OpenMP!.
We can see this more clearly by getting each thread to identify itself. Please copy the code from one of these examples to create the executable hello_threads.
Try running this executable using different values of OMP_NUM_THREADS.
The OpenMP parallel section is specified by using compiler directives. These directives (also called compiler pragmas) are instructions to the compiler to tell it how to create the team of threads, and to help tell the compiler how to assign threads to tasks. These OpenMP directives are only followed if the program is compiled with OpenMP support. If the program is compiled without OpenMP support, then they are ignored.
There are several OpenMP directives. This course will cover the basic usage of just a selection;
Pragmas are added to the code using either;
#pragma omp name_of_directive
in C or C++, or using;
C$OMP name_of_directive
at the start of the line in fixed-format Fortran, or using;
!$OMP name_of_directive
anywhere in the line using free-format Fortran.
Copy this into the file hello_threads.c
#include <stdio.h> #ifdef _OPENMP #include <omp.h> #else #define omp_get_num_threads() 0 #define omp_get_thread_num() 0 #endif int main(int argc, char **argv) { int nthreads, thread_id; printf("I am the main thread.\n"); #pragma omp parallel private(nthreads, thread_id) { nthreads = omp_get_num_threads(); thread_id = omp_get_thread_num(); printf("Hello. I am thread %d out of a team of %d\n", thread_id, nthreads); } printf("Here I am, back to the main thread.\n"); return 0; }
This example uses two OpenMP functions;
Note that using these functions requires you to include the omp.h header file. To ensure portability (if OpenMP is not supported) we hide this header file behind an #ifdef _OPENMP guard, and add stubs for the two OpenMP functions to set them to 0.
This example uses a slightly modified omp parallel line. In this case, private(nthreads, thread_id) is added to specify that each thread should have its own copy of the nthreads and thread_id variables.
You can compile this program using one of these commands (choose one for the compiler you wish to use);
This will produce the executable, hello_threads.
Now return to the previous page.
Copy this into the file hello_threads.cpp
#include <iostream> #ifdef _OPENMP #include <omp.h> #else #define omp_get_num_threads() 0 #define omp_get_thread_num() 0 #endif int main(int argc, const char **argv) { std::cout << "I am the main thread.\n"; #pragma omp parallel { int nthreads = omp_get_num_threads(); int thread_id = omp_get_thread_num(); std::cout << "Hello. I am thread " << thread_id << " out of a team of " << nthreads << std::endl; } std::cout << "Here I am, back to the main thread.\n"; return 0; }
This example uses two OpenMP functions;
Note that using these functions requires you to include the omp.h header file. To ensure portability (if OpenMP is not supported) we hide this header file behind an #ifdef _OPENMP guard, and add stubs for the two OpenMP functions to set them to 0.
Note also that because the nthreads and thread_id variables exist only in the scope of the omp parallel section, each thread will have its own private copy of these variables.
You can compile this program using one of these commands (choose one for the compiler you wish to use);
This will produce the executable, hello_threads.
Now return to the previous page.
Copy this into the file hello_threads.F (make sure you use a capital .F)
program main #ifdef _OPENMP use omp_lib #endif implicit none integer thread_id integer num_threads print *,"I am the main thread." C$OMP PARALLEL PRIVATE(thread_id, num_threads) #ifdef _OPENMP thread_id = omp_get_thread_num() num_threads = omp_get_num_threads() #else thread_id = 0 num_threads = 1 #endif print *,"Hello. I am thread ",thread_id," out of a team of ", . num_threads C$OMP END PARALLEL print *,"Here I am, back to the main thread." end
This example uses two OpenMP functions;
Note that using these functions requires you to use the omp_lib Fortran library. To ensure portability (if OpenMP is not supported) we hide this header file behind an #ifdef _OPENMP guard, and manually set the values of num_threads and thread_id if _OPENMP is not set.
This example uses a slightly modified omp parallel line. In this case, private(nthreads, thread_id) is added to specify that each thread should have its own copy of the nthreads and thread_id variables.
You can compile this program using one of these commands (choose one for the compiler you wish to use);
This will produce the executable, hello_threads.
Now return to the previous page.
The OpenMP sections directive provides a means by which different threads can run different parts of the program in parallel. Copy the code from the appropriate link below to create the executable omp_sections.
Try running the executable omp_sections using different values of OMP_NUM_THREADS.
In this example, there are three functions, times_table, countdown and long_loop. These three functions are called from within an OpenMP sections directive, where each function is placed into a separate OpenMP section. This tells the compiler that these three functions can be called in parallel, and a thread from the team can be assigned to each section. Note that if there are more sections than threads, then each section is queued until a thread is available to run it, while if there are more threads than sections, then the extra threads will have nothing to do. Note that there is no guarantee as to the order in which sections are executed.
Create the file omp_sections.c and copy in the following;
#include <stdio.h> #include <unistd.h> #ifdef _OPENMP #include <omp.h> #else #define omp_get_thread_num() 0 #endif void times_table(int n) { int i, i_times_n, thread_id; thread_id = omp_get_thread_num(); for (i=1; i<=n; ++i) { i_times_n = i * n; printf("Thread %d says %d times %d equals %d.\n", thread_id, i, n, i_times_n ); sleep(1); } } void countdown() { int i, thread_id; thread_id = omp_get_thread_num(); for (i=10; i>=1; --i) { printf("Thread %d says %d...\n", thread_id, i); sleep(1); } printf("Thread %d says \"Lift off!\"\n", thread_id); } void long_loop() { int i, thread_id; double sum = 0; thread_id = omp_get_thread_num(); for (i=1; i<=10; ++i) { sum += (i*i); sleep(1); } printf("Thread %d says the sum of the long loop is %f\n", thread_id, sum); } int main(int argc, char **argv) { printf("This is the main thread.\n"); #pragma omp parallel { #pragma omp sections { #pragma omp section { times_table(12); } #pragma omp section { countdown(); } #pragma omp section { long_loop(); } } } printf("Back to the main thread. Goodbye!\n"); return 0; }
Note that we have included the header file unistd.h to use the function sleep, to add one second pauses (via sleep(1)) within each function.
In this example, the omp sections specifies a block of sections that may be run in parallel, with each individual section specified within each omp section block. While it is possible to write the code within each omp section block directly, the code is more readable if you write each section as a function (e.g. countdown, long_loop and times_table) and just call the function from within each section.
You can compile this program using one of these commands (choose one for the compiler you wish to use);
This will produce the executable, sections.
Now return to the previous page.
Create the file omp_sections.cpp and copy in the following;
#include <iostream> #include <unistd.h> #ifdef _OPENMP #include <omp.h> #else #define omp_get_thread_num() 0 #endif void times_table(int n) { int thread_id = omp_get_thread_num(); for (int i=1; i<=n; ++i) { int i_times_n = i * n; std::cout << "Thread " << thread_id << " says " << i << " times " << n << " equals " << i_times_n << std::endl; sleep(1); } } void countdown() { int thread_id = omp_get_thread_num(); for (int i=10; i>=1; --i) { std::cout << "Thread " << thread_id << " says " << i << "...\n"; sleep(1); } std::cout << "Thread " << thread_id << " says \"Lift off!\"\n"; } void long_loop() { double sum = 0; int thread_id = omp_get_thread_num(); for (int i=1; i<=10; ++i) { sum += (i*i); sleep(1); } std::cout << "Thread " << thread_id << " says the sum of the long loop is " << sum << std::endl; } int main(int argc, const char **argv) { std::cout << "This is the main thread.\n"; #pragma omp parallel { #pragma omp sections { #pragma omp section { times_table(12); } #pragma omp section { countdown(); } #pragma omp section { long_loop(); } } } std::cout << "Back to the main thread. Goodbye!\n"; return 0; }
Note that we have included the header file unistd.h to use the function sleep, to add one second pauses (via sleep(1)) within each function.
In this example, the omp sections specifies a block of sections that may be run in parallel, with each individual section specified within each omp section block. While it is possible to write the code within each omp section block directly, the code is more readable if you write each section as a function (e.g. countdown, long_loop and times_table) and just call the function from within each section.
You can compile this program using one of these commands (choose one for the compiler you wish to use);
This will produce the executable, omp_sections.
Now return to the previous page.
Create the file omp_sections.F and copy in the following;
subroutine times_table(n) use omp_lib implicit none integer i, n, i_times_n integer thread_id thread_id = omp_get_thread_num() do 100 i=1,n i_times_n = i * n print *,"Thread ",thread_id," says ", . i," times ",n," equals ",i_times_n call sleep(1) 100 continue end subroutine countdown use omp_lib implicit none integer i, thread_id thread_id = omp_get_thread_num() do 200 i=10,1,-1 print *,"Thread ",thread_id," says ",i,"..." call sleep(1) 200 continue print *,"Thread ",thread_id," says Blast off!" end subroutine long_loop use omp_lib implicit none integer i, thread_id double precision sum thread_id = omp_get_thread_num() sum = 0 do 300 i=1,10 sum = sum + (i*i) call sleep(1) 300 continue print *,"Thread ",thread_id," says the sum of the long ", . "loop equals ",sum end program main implicit none C$OMP PARALLEL C$OMP SECTIONS C$OMP SECTION call times_table(12) C$OMP SECTION call countdown() C$OMP SECTION call long_loop() C$OMP END SECTIONS C$OMP END PARALLEL end
Note that this example calls the sleep subroutine to add one second pauses (via call sleep(1)) within each function.
In this example, the code from OMP SECTIONS to OMP END SECTIONS specifies a block of sections that may be run in parallel, with each individual section specified using OMP SECTION. While it is possible to write the code within each OMP SECTION block directly, the code is more readable if you write each section as a function (e.g. countdown, long_loop and times_table) and just call the function from within each section.
You can compile this program using one of these commands (choose one for the compiler you wish to use);
This will produce the executable, sections.
Now return to the previous page.
OpenMP sections provide a method by which you can assign different functions to be run by different threads. While this is easy to do, it does not scale well. You can only run as many threads in parallel as there are different functions to be run. If you have more threads than functions, then the extra threads will be idle. Also, if different functions take different amounts of time, then some threads may finish earlier than other threads, and they will be left idle waiting for the remaining threads to finish.
One way of achieving better performance is to use OpenMP to parallelise loops within your code. Lets imagine you have a loop that requires 1000 iterations. If you have two threads in the OpenMP team, then it would make sense for one thread to perform 500 of the 1000 iterations while the other thread performs the other 500 of 1000 iterations. This will scale as as more threads are added, the iterations of the loop can be shared evenly between them, e.g.
Of course, this only scales up to the number of iterations in the loop, e.g. if there are 1500 threads, then 1000 threads will have 1 iteration each, while 500 threads will sit idle.
Also, and this is quite important, this will only work if each iteration of the loop is independent. This means that it should be possible to run each iteration in the loop in any order, and that each iteration does not affect any other iteration. This is necessary as running loop iterations in parallel means that we cannot guarantee that loop iteration 99 will be performed before loop iteration 100.
To see how to run an OpenMP parallel loop, copy out the appropriate code from the links below to create the executable loops;
Again, run this executable with different values of OMP_NUM_THREADS to get an idea of how it performs.
Copy this into loops.c
#include <stdio.h> #ifdef _OPENMP #include <omp.h> #else #define omp_get_thread_num() 0 #endif int main(int argc, char **argv) { int i, thread_id, nloops; #pragma omp parallel private(thread_id, nloops) { nloops = 0; #pragma omp for for (i=0; i<1000; ++i) { ++nloops; } thread_id = omp_get_thread_num(); printf("Thread %d performed %d iterations of the loop.\n", thread_id, nloops ); } return 0; }
The new directive in this code is omp for. This tells the compiler that the for loop directly below this pragma can be run in parallel by the team of threads. In this case, the only work performed in each iteration is increasing the thread private counter of the number of times that the loop has been performed (each thread in the team has its own copy of nloops because it is specified as private as part of the OpenMP parallel pragma). By doing this, each thread in the team counts up the number of times that it has performed the loop.
You can compile this program using one of these commands (choose one for the compiler you wish to use);
This will produce the executable, loops.
Now return to the previous page.
Copy this into loops.cpp
#include <iostream> #ifdef _OPENMP #include <omp.h> #else #define omp_get_thread_num() 0 #endif int main(int argc, const char **argv) { #pragma omp parallel { int nloops = 0; #pragma omp for for (int i=0; i<1000; ++i) { ++nloops; } int thread_id = omp_get_thread_num(); std::cout << "Thread " << thread_id << " performed " << nloops << " iterations of the loop.\n"; } return 0; }
The new directive in this code is omp for. This tells the compiler that the for loop directly below this pragma can be run in parallel by the team of threads. In this case, the only work performed in each iteration is increasing the thread private counter of the number of times that the loop has been performed (each thread in the team has its own copy of nloops because this variable was declared only within the scope of the omp parallel block). By doing this, each thread in the team counts up the number of times that it has performed the loop.
You can compile this program using one of these commands (choose one for the compiler you wish to use);
This will produce the executable, loops.
Now return to the previous page.
Copy this into loops.F
program main use omp_lib implicit none integer i, thread_id, nloops C$OMP PARALLEL PRIVATE(thread_id, nloops) thread_id = omp_get_thread_num() nloops = 0 C$OMP DO do 100 i=1,1000 nloops = nloops + 1 100 continue print *,"Thread ",thread_id," performed ", . nloops," iterations." C$OMP END PARALLEL end
The new directive in this code is $OMP DO. This tells the compiler that the do loop directly below this pragma can be run in parallel by the team of threads. In this case, the only work performed in each iteration is increasing the thread private counter of the number of times that the loop has been performed (each thread in the team has its own copy of nloops because it is specified as PRIVATE as part of the !OMP PARALLEL directive). By doing this, each thread in the team counts up the number of times that it has performed the loop.
You can compile this program using one of these commands (choose one for the compiler you wish to use);
This will produce the executable, loops.
Now return to the previous page.
Up to this point, the examples have shown how to just run different parts of the program in parallel, either by using sections, and running different sections using different threads, or using loops, and running different iterations of the loop using different threads. While this is good, it isn't yet useful. This is because we have not yet seen how to make the threads work together on a problem. At the moment, each thread works on its own, using its own private variables. For example, we've seen how each thread can count the number of iterations it performed itself in a loop, but no mechanism has been presented yet that would allow the team of threads to count up the total number of iterations performed by all of them.
What we need is a way to allow the threads to combine their thread private copies of variables into a global copy. One method of doing this is to use an OpenMP critical section. A critical section is a part of code that is performed by all of the threads in the team, but that is only performed by one thread at a time. This allows each thread to update a global variable with the local result calculated by that thread, without having to worry about another thread trying to update that global variable at the same time. To make this clear, use the appropriate link below to create the two executables, broken_loopcount and fixed_loopcount;
Try running both executables using different values of OMP_NUM_THREADS. Depending on the compiler, programming language and number of threads, broken_loopcount may print out a wide range of different outputs. Sometimes it will work, and will correctly add the number of iterations onto the global sum, and will correctly print out the intermediate steps without problem. However, sometimes, completely randomly, it will break, and either it will print out nonsense (e.g. it will add 1000 iterations to a total of 4000, but the insist that the total is 10000) or it will get the total number of iterations completely wrong. The reason for this is that while one thread is updating or printing the global total, another thread may be changing it.
The fixed_loopcount in contrast will always work, regardless of compiler, programming language or number of threads. This is because we've protected the update and printing of the global total within an OpenMP critical section. The critical section ensures that only one thread at a time is printing and updating the global sum of loops, and so ensures that two threads don't try to access global_nloops simultaneously.
OpenMP critical sections are extremely important if you want to guarantee that your program will work reproducibly. Without critical sections, random bugs can sneak through, and the result of your program may be different if different numbers of threads are used.
OpenMP loops plus OpenMP critical sections provide the basis for one of the most efficient models of parallel programming, namely map and reduce. The idea is that you map the problem to be solved into a loop over a large number of iterations. Each iteration solves its own part of the problem and computes the result into a local thread-private variable. Once the iteration is complete, all of the thread-private variables are combined together (reduced) via critical sections to form the final global result.
Pi can be calculated using Monte Carlo by imagining a circle with radius 1 sitting at the origin within a square that just contains this circle (so with corners [-1,-1], [-1,1], [1,-1] and [1,1]). The area of the circle is pi, (from pi r squared), while the area of the square is 4. If we imagine throwing darts randomly at the square, than the proportion that lie within the circle compared to the proportion that lie outside the circle will be directly related to the ratio of the area of the circle against the area of the square. In a parallel loop, you must thus generate a large number of random points in the square, and count up the number that lie within the circle and those that lie outside. Reduce these numbers into a global count of the number inside and outside the circle, and then finally take the ratio of these numbers to get the value of pi. This algorithm to calculate pi is described in more detail here.
Note that you will need to use the following random number functions and square root functions;
The Fortran rand function already generates a random number between 0 and 1. To achieve this in C or C++ you need to write the following function;
double rand_one() { return rand() / (RAND_MAX + 1.0); }
To get a random number from -1 to 1 you need to use;
(2 * rand()) - 1 or (2 * rand_one()) - 1
Here are the possible answers - take a look if you get stuck or you want to check your work;
Copy this code into broken_loopcount.c
#include <stdio.h> #ifdef _OPENMP #include <omp.h> #else #define omp_get_thread_num() 0 #endif int main(int argc, char **argv) { int i, thread_id; int global_nloops, private_nloops; global_nloops = 0; #pragma omp parallel private(private_nloops, thread_id) { private_nloops = 0; thread_id = omp_get_thread_num(); #pragma omp for for (i=0; i<100000; ++i) { ++private_nloops; } printf("Thread %d adding its iterations (%d) to the sum (%d)...\n", thread_id, private_nloops, global_nloops); global_nloops += private_nloops; printf("...total nloops now equals %d.\n", global_nloops); } printf("The total number of loop iterations is %d\n", global_nloops); return 0; }
and copy this code into fixed_loopcount.c
#include <stdio.h> #ifdef _OPENMP #include <omp.h> #else #define omp_get_thread_num() 0 #endif int main(int argc, char **argv) { int i, thread_id; int global_nloops, private_nloops; global_nloops = 0; #pragma omp parallel private(private_nloops, thread_id) { private_nloops = 0; thread_id = omp_get_thread_num(); #pragma omp for for (i=0; i<100000; ++i) { ++private_nloops; } #pragma omp critical { printf("Thread %d adding its iterations (%d) to the sum (%d)...\n", thread_id, private_nloops, global_nloops); global_nloops += private_nloops; printf("...total nloops now equals %d.\n", global_nloops); } } printf("The total number of loop iterations is %d\n", global_nloops); return 0; }
The only new code here is the omp critical section in fixed_loopcount.c. The critical section is performed by each thread, but can only be performed by one thread at a time. This ensures that while one thread is updating the global nloops variable (global_nloops) with the thread local value of nloops (private_nloops), that the value of global_nloops is not changed by any other thread.
You can compile these programs using one of these sets of commands (choose one for the compiler you wish to use);
This will produce the executables, broken_loopcount and fixed_loopcount.
Now return to the previous page.
Copy this code into broken_loopcount.cpp
#include <iostream> #ifdef _OPENMP #include <omp.h> #else #define omp_get_thread_num() 0 #endif int main(int argc, char **argv) { int global_nloops = 0; #pragma omp parallel { int private_nloops = 0; int thread_id = omp_get_thread_num(); #pragma omp for for (int i=0; i<100000; ++i) { ++private_nloops; } std::cout << "Thread " << thread_id << " adding its iterations (" << private_nloops << ") to the sum (" << global_nloops << ")...\n"; global_nloops += private_nloops; std::cout << "...total nloops now equals " << global_nloops << ".\n"; } std::cout << "The total number of loop iterations is " << global_nloops << ".\n"; return 0; }
and copy this code into fixed_loopcount.cpp
#include <iostream> #ifdef _OPENMP #include <omp.h> #else #define omp_get_thread_num() 0 #endif int main(int argc, char **argv) { int global_nloops = 0; #pragma omp parallel { int private_nloops = 0; int thread_id = omp_get_thread_num(); #pragma omp for for (int i=0; i<100000; ++i) { ++private_nloops; } #pragma omp critical { std::cout << "Thread " << thread_id << " adding its iterations (" << private_nloops << ") to the sum (" << global_nloops << ")...\n"; global_nloops += private_nloops; std::cout << "...total nloops now equals " << global_nloops << ".\n"; } } std::cout << "The total number of loop iterations is " << global_nloops << ".\n"; return 0; }
The only new code here is the omp critical section in fixed_loopcount.cpp. The critical section is performed by each thread, but can only be performed by one thread at a time. This ensures that while one thread is updating the global nloops variable (global_nloops) with the thread local value of nloops (private_nloops), that the value of global_nloops is not changed by any other thread.
You can compile these programs using one of these sets of commands (choose one for the compiler you wish to use);
This will produce the executables, broken_loopcount and fixed_loopcount.
Now return to the previous page.
Copy this code into broken_loopcount.F
program main use omp_lib implicit none integer i, thread_id integer private_nloops, global_nloops global_nloops = 0 C$OMP PARALLEL PRIVATE(thread_id, private_nloops) thread_id = omp_get_thread_num() private_nloops = 0 C$OMP DO do 100 i=1,100000 private_nloops = private_nloops + 1 100 continue print *,"Thread ",thread_id," adding its iterations (", . private_nloops,") to the sum (",global_nloops, . ")..." global_nloops = global_nloops + private_nloops print *,"...total nloops now equals ",global_nloops C$OMP END PARALLEL print *,"The total number of loop iterations is ", . global_nloops end
and copy this code into fixed_loopcount.F
program main use omp_lib implicit none integer i, thread_id integer private_nloops, global_nloops global_nloops = 0 C$OMP PARALLEL PRIVATE(thread_id, private_nloops) thread_id = omp_get_thread_num() private_nloops = 0 C$OMP DO do 100 i=1,100000 private_nloops = private_nloops + 1 100 continue C$OMP CRITICAL print *,"Thread ",thread_id," adding its iterations (", . private_nloops,") to the sum (",global_nloops, . ")..." global_nloops = global_nloops + private_nloops print *,"...total nloops now equals ",global_nloops C$OMP END CRITICAL C$OMP END PARALLEL print *,"The total number of loop iterations is ", . global_nloops end
The only new code here is the $OMP CRITICAL to $OMP END CRITICAL section in fixed_loopcount.c. The critical section is performed by each thread, but can only be performed by one thread at a time. This ensures that while one thread is updating the global nloops variable (global_nloops) with the thread local value of nloops (private_nloops), that the value of global_nloops is not changed by any other thread.
You can compile these programs using one of these sets of commands (choose one for the compiler you wish to use);
This will produce the executables, broken_loopcount and fixed_loopcount.
Now return to the previous page.
#include <math.h> #include <stdlib.h> #include <stdio.h> double rand_one() { return rand() / (RAND_MAX + 1.0); } int main(int argc, char **argv) { int n_inside, n_outside; int pvt_n_inside, pvt_n_outside; int i; double x, y, r, pi; n_inside = 0; n_outside = 0; #pragma omp parallel private(x, y, r, pvt_n_inside, pvt_n_outside) { pvt_n_inside = 0; pvt_n_outside = 0; #pragma omp for for (i=0; i<1000000; ++i) { x = (2*rand_one()) - 1; y = (2*rand_one()) - 1; r = sqrt( x*x + y*y ); if (r < 1.0) { ++pvt_n_inside; } else { ++pvt_n_outside; } } #pragma omp critical { n_inside += pvt_n_inside; n_outside += pvt_n_outside; } } pi = (4.0 * n_inside) / (n_inside + n_outside); printf("The estimated value of pi is %f\n", pi); return 0; }
#include <cmath> #include <cstdlib> #include <iostream> double rand_one() { return std::rand() / (RAND_MAX + 1.0); } int main(int argc, const char **argv) { int n_inside = 0; int n_outside = 0; #pragma omp parallel { int pvt_n_inside = 0; int pvt_n_outside = 0; #pragma omp for for (int i=0; i<1000000; ++i) { double x = (2*rand_one()) - 1; double y = (2*rand_one()) - 1; double r = sqrt( x*x + y*y ); if (r < 1.0) { ++pvt_n_inside; } else { ++pvt_n_outside; } } #pragma omp critical { n_inside += pvt_n_inside; n_outside += pvt_n_outside; } } double pi = (4.0 * n_inside) / (n_inside + n_outside); std::cout << "The estimated value of pi is " << pi << std::endl; return 0; }
program main implicit none integer n_inside, n_outside integer pvt_n_inside, pvt_n_outside integer i double precision x,y,r,pi n_inside = 0 n_outside = 0 C$OMP PARALLEL pvt_n_inside = 0 pvt_n_outside = 0 C$OMP DO do 100 i=1,1000000 x = (2 * rand()) - 1 y = (2 * rand()) - 1 r = sqrt(x**2 + y**2) if (r < 1) then pvt_n_inside = pvt_n_inside + 1 else pvt_n_outside = pvt_n_outside + 1 endif 100 continue C$OMP CRITICAL n_inside = n_inside + pvt_n_inside n_outside = n_outside + pvt_n_outside C$OMP END CRITICAL C$OMP END PARALLEL pi = 4 * dble(n_inside) / dble(n_inside + n_outside) print *,"The estimated value of pi is ",pi end
Reduction, which is the process of combining (or reducing) the results of several sub-calculations into a single combined (reduced) result, is very common, and is the second half of the very powerful map-reduce form of parallel programming. In the exercise in the last section you used reduction to form the global sum of the total number of points inside and outside the circle, calculated as the sum of the number of points inside and outside the circle calculated by each thread's iterations of the loop. While it may appear easy to write your own reduction code, it is actually very hard to write efficient reduction code. This is because reduction requires the use of a critical section, where only one thread is allowed to update the global sum at a time. Reduction can actually be implemented much more efficiently, e.g. perhaps by dividing threads into pairs, and getting each pair to sum their results, and then dividing into pairs of pairs, and summing the pairs of pairs results, etc. (this method is known as binary tree reduction - see here for a more in-depth discussion of reduction algorithms).
So reduction is actually quite complicated to implement if you want it to be efficient and work well. Fortunately, you don't have to implement it, as OpenMP provides a reduction directive which has implemented it for you! The reduction directive is added to the end of the OpenMP parallel directive and has the following form;
reduction( operator : variable list )
where operator can be any binary operator (e.g. +, -, *), and variable list is a single variable or list of variables that will be used for the reduction, e.g.
reduction( + : sum )
will tell the compiler that sum will hold the global result of a reduction which will be formed by adding together the results of thread-private calculations, while,
reduction( - : balance, diff )
will tell the compiler that both balance and diff will hold the global results of reductions which are formed by subtracting the results of thread-private calculations.
To make this clear, the following links provide the code for the fixed loop counting examples from the last section which use reduction rather than thread-private variables with an OpenMP critical section;
Here's the answers for checking (no peeking before you've finished!)
#include <stdio.h> int main(int argc, char **argv) { int i; int private_nloops, nloops; nloops = 0; #pragma omp parallel private(private_nloops) \ reduction(+ : nloops) { private_nloops = 0; #pragma omp for for (i=0; i<100000; ++i) { ++private_nloops; } /* Reduction step - reduce 'private_nloops' into 'nloops' */ nloops = nloops + private_nloops; } printf("The total number of loop iterations is %d\n", nloops); return 0; }
Return to the previous page.
#include <iostream> int main(int argc, char **argv) { int nloops = 0; #pragma omp parallel reduction( + : nloops ) { int private_nloops = 0; #pragma omp for for (int i=0; i<100000; ++i) { ++private_nloops; } nloops = nloops + private_nloops; } std::cout << "The total number of loop iterations is " << nloops << ".\n"; return 0; }
Return to the previous page.
program main implicit none integer i integer private_nloops, nloops nloops = 0 C$OMP PARALLEL PRIVATE(private_nloops) C$OMP& REDUCTION( + : nloops ) private_nloops = 0 C$OMP DO do 100 i=1,100000 private_nloops = private_nloops + 1 100 continue nloops = nloops + private_nloops C$OMP END PARALLEL print *,"The total number of loop iterations is ", . nloops end
Return to the previous page.
#include <math.h> #include <stdlib.h> #include <stdio.h> double rand_one() { return rand() / (RAND_MAX + 1.0); } int main(int argc, char **argv) { int n_inside, n_outside; int pvt_n_inside, pvt_n_outside; int i; double x, y, r, pi; n_inside = 0; n_outside = 0; #pragma omp parallel private(x, y, r, pvt_n_inside, pvt_n_outside) \ reduction( + : n_inside, n_outside ) { pvt_n_inside = 0; pvt_n_outside = 0; #pragma omp for for (i=0; i<1000000; ++i) { x = (2*rand_one()) - 1; y = (2*rand_one()) - 1; r = sqrt( x*x + y*y ); if (r < 1.0) { ++pvt_n_inside; } else { ++pvt_n_outside; } } n_inside = n_inside + pvt_n_inside; n_outside = n_outside + pvt_n_outside; } pi = (4.0 * n_inside) / (n_inside + n_outside); printf("The estimated value of pi is %f\n", pi); return 0; }
#include <cmath> #include <cstdlib> #include <iostream> double rand_one() { return std::rand() / (RAND_MAX + 1.0); } int main(int argc, const char **argv) { int n_inside = 0; int n_outside = 0; #pragma omp parallel reduction(+ : n_inside, n_outside) { int pvt_n_inside = 0; int pvt_n_outside = 0; #pragma omp for for (int i=0; i<1000000; ++i) { double x = (2*rand_one()) - 1; double y = (2*rand_one()) - 1; double r = sqrt( x*x + y*y ); if (r < 1.0) { ++pvt_n_inside; } else { ++pvt_n_outside; } } n_inside += pvt_n_inside; n_outside += pvt_n_outside; } double pi = (4.0 * n_inside) / (n_inside + n_outside); std::cout << "The estimated value of pi is " << pi << std::endl; return 0; }
program main implicit none integer n_inside, n_outside integer pvt_n_inside, pvt_n_outside integer i double precision x,y,r,pi n_inside = 0 n_outside = 0 C$OMP PARALLEL REDUCTION( + : n_inside, n_outside ) pvt_n_inside = 0 pvt_n_outside = 0 C$OMP DO do 100 i=1,1000000 x = (2 * rand()) - 1 y = (2 * rand()) - 1 r = sqrt(x**2 + y**2) if (r < 1) then pvt_n_inside = pvt_n_inside + 1 else pvt_n_outside = pvt_n_outside + 1 endif 100 continue n_inside = n_inside + pvt_n_inside n_outside = n_outside + pvt_n_outside C$OMP END PARALLEL pi = 4 * dble(n_inside) / dble(n_inside + n_outside) print *,"The estimated value of pi is ",pi end
We have now covered enough that we can use OpenMP to parallelise a map/reduce style calculation. In this case, the problem we will solve will be calculating the total interaction energy between each ion in an array of ions with a single reference ion. Passed into this function will be the reference ion, and an array of ions. The algorithm performed for each ion in the array will be;
Map/reduce can be used when you have an array of data, a function you wish to apply (to map) to each item in the array, and a single value you want back that is the reduction of the results of applying the function to each item in the array. In terms of map/reduce, our algorithm would look like this;
Here are incomplete pieces of code that implement this algorithm (note that this is to provide an example of how map/reduce can be used - you don't need to complete this code);
Note that the amount of OpenMP in these examples is very low (just 2-3 lines). This is quite common for OpenMP programs - most of the work of parallelisation is organising your code so that it can be parallelised. Once it has been organised, you then only need to add a small number of OpenMP directives.
Compile the above programs, and try running them using different numbers of threads.
#include <stdio.h> #include <math.h> #include <stdlib.h> /* Define an Ion type to hold the coordinates of an Ion */ typedef struct Ion { double x; double y; double z; } Ion; /* The reference Ion */ struct Ion reference_ion; /* Return the square of a number */ double square(double x) { return x*x; } /* Energy function to be mapped */ double calc_energy( struct Ion ion ) { double r; r = sqrt( square( reference_ion.x - ion.x ) + square( reference_ion.y - ion.y ) + square( reference_ion.z - ion.z ) ); /* The energy is simply 1 / r */ return 1.0 / r; } /* You will need to fill in this function to read in an array of ions and return the array */ struct Ion* readArrayOfIons(int *num_ions) { int i; Ion *ions = (Ion*)malloc(10 * sizeof(Ion)); *num_ions = 10; for (i=0; i<10; ++i) { ions[i].x = 0.0; ions[i].y = 0.0; ions[i].z = i; } return ions; } int main(int argc, char **argv) { int i, num_ions; struct Ion *ions_array; double total_energy, mapped_energy; /* Lets put the reference ion at (1,2,3) */ reference_ion.x = 1.0; reference_ion.y = 2.0; reference_ion.z = 3.0; /* Now read in an array of ions */ ions_array = readArrayOfIons( &num_ions ); total_energy = 0.0; #pragma omp parallel private(mapped_energy) \ reduction( + : total_energy ) { mapped_energy = 0.0; #pragma omp for for (i=0; i < num_ions; ++i) { /* Map this ion against the function */ mapped_energy += calc_energy( ions_array[i] ); } /* Reduce to get the result */ total_energy += mapped_energy; } printf("The total energy is %f\n", total_energy); return 0; }
#include <cmath> #include <vector> #include <iostream> // Define an ion type to hold the // coordinates of the ion struct Ion { Ion() { x = 0; y = 0; z = 0; } Ion(double _x, double _y, double _z) { x = _x; y = _y; z = _z; } double x; double y; double z; }; // The reference ion Ion reference_ion; // Return the square of a number inline double square(const double x) { return x*x; } // Energy function to be mapped double calc_energy(const Ion &ion) { double r = std::sqrt( square( reference_ion.x - ion.x ) + square( reference_ion.y - ion.y ) + square( reference_ion.z - ion.z ) ); // The energy is simply 1 / r return 1.0 / r; } // You would need to fill in this function to read // in an array of ions and return the array std::vector<Ion> readArrayOfIons() { std::vector<Ion> ions; for (int i=0; i<10; ++i) { ions.push_back( Ion(0,0,i) ); } return ions; } int main(int argc, const char **argv) { //put the reference ion at (1,2,3) reference_ion.x = 1.0; reference_ion.y = 2.0; reference_ion.z = 3.0; //read in the array of ions std::vector<Ion> ions_array = readArrayOfIons(); double total_energy = 0.0; #pragma omp parallel reduction( + : total_energy ) { double mapped_energy = 0.0; const int n_ions = ions_array.size(); #pragma omp for for (int i=0; i < n_ions; ++i) { //map this ion against the function mapped_energy += calc_energy( ions_array[i] ); } //reduce to get the result total_energy += mapped_energy; } std::cout << "The total energy is " << total_energy << std::endl; return 0; }
c Here is the function used to calculate the energy c of the passed ion double precision function calc_energy(ref_x, ref_y, ref_z, . ion_x, ion_y, ion_z) implicit none double precision ion_x, ion_y, ion_z double precision ref_x, ref_y, ref_z double precision r r = sqrt( (ref_x - ion_x)**2 + . (ref_y - ion_y)**2 + . (ref_z - ion_z)**2 ) c The energy is just 1 / r calc_energy = 1.0 / r return end c You would need to fill in this subroutine to read c in the array of ions subroutine readArrayOfIons(ions_array, n_ions) implicit none integer i, n_ions integer MAX_IONS parameter(MAX_IONS = 1000) double precision ions_array(MAX_IONS,3) n_ions = 10 do i=1,10 ions_array(i,1) = 0.0 ions_array(i,2) = 0.0 ions_array(i,3) = i-1 enddo end program main implicit none integer i, n_ions integer MAX_IONS parameter(MAX_IONS = 1000) double precision ions_array(MAX_IONS, 3) double precision total_energy, mapped_energy double precision calc_energy double precision ref_ion(3) ref_ion(1) = 1 ref_ion(2) = 2 ref_ion(3) = 3 call readArrayOfIons(ions_array, n_ions) total_energy = 0 C$OMP PARALLEL PRIVATE(mapped_energy) REDUCTION( + : total_energy) mapped_energy = 0 C$OMP DO do 100 i=1,n_ions C map this ion against the function mapped_energy = mapped_energy + calc_energy(ref_ion(1), . ref_ion(2), . ref_ion(3), . ions_array(i, 1), . ions_array(i, 2), . ions_array(i, 3)) 100 continue C Reduce to get the result total_energy = total_energy + mapped_energy C$OMP END PARALLEL print *,"The total energy is ",total_energy end
OpenMP is reasonably straightforward to use, and can be added to existing programs. However, don't let this simplicity deceive you. While writing OpenMP code is straightforward, writing efficient, scalable OpenMP code can be hard, and requires you to think deeply about the problem to be solved. As you saw in the map/reduce example, the amount of OpenMP was quite small, but the application itself had to be arranged in a way that allowed the problem to be solved using a map/reduce approach.
While the techniques used to get good performance using OpenMP are problem specific, there are a set of general guidelines that will put you on the right track;
However, the most important advice is benchmark, benchmark, benchmark! Don't try and guess the performance of your code, or guess that something makes it faster. Actually time how fast your code runs (e.g. using the time command or using timers within your code), and compare timings for different numbers of threads (by changing OMP_NUM_THREADS) every time you make changes to your code. Make benchmarking and performance testing as important in your coding workflow as debugging and unit testing.
The paper Multicore Parallelization of Kohn-Sham Theory presents an OpenMP parallelisation of a Density Functional Theory (DFT) calculation. The calculation is performed within the quantum chemical program Molpro. Molpro is a large (1.5 million lines of code) program written in a variety of versions of Fortran. Benchmarking revealed that the bottleneck of the calculation was the numerical integration of integrals using a quadrature grid. We thus decided to use OpenMP to parallelise this calculation by using an OpenMP loop to map batches of grid points for evaluation to different threads, and to then use OpenMP reduction to sum together the sum of the integral evaluations at each grid point.
To achieve this, we had to restructure the way that Molpro represented the quadrature grid and basis functions for the integrals. We decided that we would write the code in C++ as this would let us have greater control of which data was thread-local and which was global, and it allowed us to create objects that represented different bits of data in the calculation (e.g. Grid, Orbital, Matrix etc. - if you compare the Fortran, C and C++ examples in this workshop you will see that the OpenMP directives for C++ are much easier as the private directive is not needed). Rather than rewrite the whole of Molpro in C++ (which would have taken years!) we just rewrote the function that performed the numerical integration in C++, and then linked this new C++ function into the Fortran program. This works, as C++ and Fortran are both compiled to machine code, and there are now standard methods for calling Fortran functions from C++, and for calling C++ functions from Fortran.
Rewriting the code in this way allowed us to speed up the DFT calculation in Molpro by over 10 times on a 16 core Intel Xeon (a quad-socket quad-core node). The rewrite itself took a few months, the majority of which were spent thinking about how to restructure the algorithm so that it could be easily parallelised using OpenMP.
The above graphs show the scaling behaviour of the resulting code as a function of the number of OpenMP threads. Three calculations were performed on each of the four different benchmark platforms, with each calculation using a sequentially larger data size. This allowed us to benchmark how well the code scaled as we increased the number of cores, and also how the code behaved as the amount of data to process became larger than the bandwidth between main memory and the processor (the solid line is the smallest problem, the dashed line is the medium problem and the dotted line is the large problem). This kind of benchmarking is invaluable when writing parallel code, and is also useful for machine evaluation, e.g. in this case we can see that the Intel Xeon processors have more bandwidth so cope better than the Opterons as the problem size increases. It should be noted that the Opterons tested here were much older than the Xeons, and that modern Opterons are likely to be comparable to modern Xeons.
This course was designed to give you a quick taster of OpenMP. If you enjoyed it, and want to learn more, then I strongly encourage you to take a look at a book on OpenMP, e.g. Using OpenMP: Portable Shared Memory Parallel Programming, or to check out other OpenMP websites or performing a web search for openmp tutorial or openmp for beginners.
The best way to learn any programming language or tool is to read other people's code and copy it. Please feel free to copy, adapt and play with the examples in this workshop. They should hopefully provide the starting points for basic parallelisation of your code using OpenMP.