First 5 minutes of hell
OpenMP is a high-level API for shared-memory parallel programming (runs in each node, handles communication). It uses parametrized directives for the compiler, global variables and library of the system operations to parallelize the program run.
OpenMP utilizes the fork-join programming model: the program starts with one initial thread that runs sequentially (other threads are initiated and waiting in the thread pool). The program run then consists of multiple parallel regions, where specified teams of threads are forked from the main thread, run the computations and then are joined back at the end (by an implicit barrier).
- if some of the threads fails prematurely, then all threads are terminated, resulting in the program’s failure
- the programmer is fully in control and is responsible for handling all parallelism challenges
OpenMP supports loop (data) and functional (task) parallelism model + it supports a relaxed consistence memory model (= threads can keep local values in their cache and not write-through them to the main memory immediately, explicitly, we can use
flush())Parallel region is defined by the
#pragma omp parallel (parameters), which uses the OpenMP API to wake up thenum_treads(n)threads, run the parallel block of code and then go back to sleep in the thread pool.
- the number of threads is fixed during the parallel region
- the regions could be nested
Properties of variables (each variable gets an OpenMP property):
- shared: are shared (visible) among threads
- private: private, uninitialized variables for each thread
- firstprivate: private, but initialized to the master’s value before going into parallel region
- default: set a default for all unspecified variables
- reduction: the variable is private to each thread, each thread populates it separately and then at the end of the parallel region, all results are reduced into one variable using the specified reduction operator (+, -, *, etc.)
- threadprivate: this defines “private” on the global level, each thread has it’s own copy that is persisted across parallel regions
- the properties of pointers apply to the pointers, not underlying objects
POSIX threads vs. OpenMP - why OpenMP exists?
- the standard for the multi-threaded programming: POSIX
- low-level, requires expertise, complicated to learn/debug
- originally designed for kernel operations
- OpenMP library
- a high-level API for multithreaded programming on a shared memory
- it runs within every node in the parallel computer
- internode communication is handled by OpenMPI
- node = multicore shared memory system
- it is portable across platforms and standardized (+industry support)
- do not use it with other multithreaded libraries, it may interfere with each other
- consists of
- parametrized directives for the compiler
- global variables
- library operations of the runtime environment
- a high-level API for multithreaded programming on a shared memory
- programmer’s responsibility:
- check data dependencies, data races, data access collisions, or deadlocks
- ensure the efficient utilization of the shared memory
The fork-join programming model
- the run of the application is divided into regions that could be parallelized
- the program starts with a single initial thread that executes sequentially
- at each parallel region, “teams of threads” that are forked from the initial thread and then joined back together at the end of the parallel region (there is a implicit barrier)
- these parallel regions can be nested
- OpenMP creates a thread pool with threads waiting to join the computations (to save time on creating/deleting threads on every fork-join)
- after the synchronization, the threads are returned to the pool and waiting for the next fork
- almost everything is the programmer’s responsibility (data types, racing conditions, access collisions etc.)
- no optimizations included
- avoid thread-unsafe operations in the parallel regions
master ──●──[fork]──┬──thread 0──┬──●──[fork]──┬──thread 0──┬──●── master
├──thread 1──┤ ├──thread 1──┤
└──thread 2──┘ ├──thread 2──┤
└──thread 3──┘
sequential parallel sequential parallel sequential
If during a parallel region any thread is prematurely terminated, then all threads of all teams are terminated and the entire OpenMP program is killed.
OpenMP parallel regions
- syntax for creating the parallel region:
#include <omp.h>
#pragma omp parallel [clause[ [,] clause] ...]
{ structured block of parallel region }- when a thread goes into a
parallelregion, it becomes the master thread with rank 0 for the new team of threads- main clauses:
if(condition)- if the condition is satisfied, a new team of threads is created- if false: no team is created and the code runs sequentially
num_threads(expr)- ets the number of threads in the team
- variables have their modes:
shared(list)- variables in the list are visible to all threads (cannot be arrays or structures)
- programmer is fully responsible for the synchronization
- ! warning: if the variable is a pointer, it’s properties apply only on the pointer, not on the data it points to
private(list)- each thread has it’s own uninitialized copy of the variable and it is destroyed at the region’s end
- after completing the region, the values are back to values it had before the parallel block
firstprivate(list)- like private variables, but they are initialized to the values in the master before coming into the parallel region
default(shared|none)- sets the default mode for all unspecified variables
none: programmer needs to specify the value for all variables
- ! warning: if the variable is a pointer, it’s properties apply only on the pointer, not on the data it points to
- main clauses:
- if any thread in the parallel region fails during computation, the runtime terminates all threads in all teams resulting in the program fail
- the number of threads is fixed during the whole region
- it is not allowed to jump out of a parallel region (or jump in from outside)
- at the end, there is always an implicit barrier to synchronize the threads before continuing
- nested regions:
- one of the threads can be a master of it’s own team of sub-threads
- the depth of this recursion is bounded by the hardware limitations
Reduction property
- reduction is really important and handy
- =
reduction(operator:variable) - each thread has it’s own copy (the mode is
private) initialized by identity element of the operator (0 for sum, 1 for multiplication) that is being accumulated by itself - at the end, all copies of that variable are reduced together using the reduce operator (+, -, * etc.)
- the + operator is not truly associative for floats (so the result may differ depending on the order, but it is generally acceptable)
- there are two reduction implementations:
- linear (sequential): the result from each array is reduced one by one (using the reduction operator) - this is used by OpenMP
- logarithmic - it has fewer operations, but the synchronization barriers are required in every parallel step, which causes huge overhead → linear is used in OpenMP (as is usually much smaller than , so the linear time is negligible)
- cannot be combined with task directive
Threadprivate property
- for implementing a counter or other accumulative variable across multiple parallel regions
- something like
private(list)property, but on the global scope, the values are not destroyed at the end of the parallel region, but they persist through multiple parallel regions - in all following parallel regions there must be the same amount of threads in each
- the property definition must precede all parallel regions
- something like
- as in the private(list) property, the values in the first parallel region are undefined
- on the local level, this is handled by firstprivate()
- on the threadprivate level (global level), this is handled by the
copyinclause (which basically copies some master-thread value into each thread)
OpenMP memory model
OpenMP supports a relaxed consistency memory model. Threads can keep local copies of shared variables in cache memory and are not forced to write-through every local update immediately into shared memory.
If the programmer needs to ensure that all threads see identical shared memory contents, they must explicitly force a synchronization - for example, using the flush() operation to flush local values of shared variables into shared memory. Certain directives (barriers, critical regions, atomic operations) include implicit flushes.
Determining the number of threads in a parallel region
- threads are numbered from 0 (master) to p - 1
- the number of threads is determined by the first rule in this order:
- The
if(condition)clause is evaluated. If false → p = 1 (sequential execution). - The value of
num_threads(expr). - The value of the last call to
omp_set_num_threads(expr). - The environment variable
OMP_NUM_THREADS.
- The
If none of these are specified, the value is implementation-dependent (typically the number of CPU cores).
- number of threads could also be controlled through ICV (internal control variables) - team size, nesting depth etc.
- accessible only through the OpenMP API calls
- e.g. dyn-var (boolean), it controls the dynamic adjustment of the number of threads in the parallel regions is enabled
- if the dyn-var is enabled, the system may give fewer threads than requested
- if dyn-var is disabled and more threads than available is requested, the behavior is implementation defined
Potential exam questions
- Describe the OpenMP fork-join programming model. What happens at the beginning and end of a parallel region? What is the role of the initial thread, the master thread, and the thread pool?
- List the three main components of the OpenMP API. Why should OpenMP directives not be mixed with other threading libraries?
- What are the motivations for OpenMP? What are its limitations (name at least four)?
- Explain the general syntax of an OpenMP construct. What is a “combined construct”? Give an example.
- Describe the
paralleldirective and its main clauses (if,num_threads, variable property clauses). What happens when theifcondition is false? - Explain each variable property:
shared,private,firstprivate,default,reduction. For each, state what the initial value is inside the region and what happens to the variable after the region ends. - What is the caveat with pointer properties in OpenMP? If a pointer is declared
shared, what does that mean for the data it points to? - Describe the
reductionclause. What operators are permitted? Why may the result differ slightly for floating-point reductions? Why does OpenMP use a linear implementation rather than a logarithmic one? Give both complexity formulas. - Explain the
threadprivatedirective. How does it differ fromprivate? What iscopyinused for? What constraint exists on the number of threads across regions? - Describe the priority ordering for determining the number of threads p in a parallel region. What role do ICVs play? What happens when
dyn-varis true vs. false? - What happens if any thread in a parallel region is prematurely terminated?
- What does “relaxed consistency memory model” mean in OpenMP? When must the programmer intervene to ensure memory visibility?