First 5 minutes of hell
Functional (task) parallelism utilizes tasks (= units of parallel computation) to enable effective parallelization of recursive algorithms. When a thread (producer) encounters the
#pragma omp taskdirective, it generates a new child task and stores it into the task pool. The task will wait there until picked up by another free thread (consumer) which will execute it.The
#pragma omp taskmust be inside a parallel region (so there are multiple threads to pick up the tasks), otherwise all tasks would be done by one thread only (so sequentially with additional overhead)
#pragma omp taskwait
- the parent task must wait for all child tasks to complete before continuing
#pragma omp single
- this directive ensures that the code will be executed by just one thread (out of threads ready - OS decides, which thread) to avoid running multiple and identical recursion trees in parallel
- other threads skip the code block, but they must wait at the end (implicit barrier)
- and there would also be increased overhead due to access conflicts to the shared memory (and more cache misses)
Important clauses:
- if condition - whether the producer thread should create a task or run the task code directly (and then continue with it’s work)
- final(expression) - if the expression is true, then this task is final and no further explicit tasks will be created
- priority(expression) - assigns a priority to the child task
- ideal for parallelizing recursive algorithms (Divide and conquer)
The task directive - syntax and semantics
#pragma omp task [clause[ [,] clause] ...]
{ task structured block }- a task is a unit of parallel computation consisting of:
- Code to be executed (the structured block)
- Input data (captured at task creation time)
- A data structure for storing the identity of the consumer thread once it starts executing the task
- the producer-consumer mechanism
- A thread executing a parent task encounters
#pragma omp task. This producer thread generates a new child task and stores it into the task pool. - The child task waits in the task pool until some free thread withdraws it and becomes the consumer thread that executes it.
- The consumer thread may itself encounter
taskdirectives during execution, becoming a producer in turn. This cycle continues until all tasks are completed.
- A thread executing a parent task encounters
- the
taskclause must be in theparallelsection, otherwise the whole task mechanism would be performed on a single thread (so sequentially and with additonal overhead)- a normal sequential solution would be better here
Clauses of the task directive
-
we have to balance the granularity of the tasks using the
if (condition)directive- what should be pushed into the child task pool (with the overhead) and what should be executed “on the spot” sequentially by the parent thread?
- balancing parallelism overhead vs. useful work
- e.g. when parallelizing the recursive binary tree
- it makes sense to parallelize the top big subtrees, but the lower short subtrees are better handled sequentially (lower overhead, faster)
-
variable properties
- variable values are captured at the task creation (default is
firstprivate(list))
- variable values are captured at the task creation (default is
-
taskwaitdirective- a synchronization point where the parent thread must wait for all of it’s direct child tasks to complete
- it is a specific kind of barrier, but just for the child tasks generated by the current task, not for all tasks in the system
- this directive is essential for the correctness in recursive algorithms
- e.g. parent needs to wait on the results of the recursive calls to sum them up
- here the sequential approach waits naturally, but the parallel could go on without the child completing and therefore, having uninitialized values
- e.g. parent needs to wait on the results of the recursive calls to sum them up
-
singledirective is used for the first recursive call- the first recursive call must be called by just one thread, otherwise it would be called by all available threads, performing identical recursion trees being executed redundantly
- the rest of the threads act only as consumers at the start and pick tasks from the task pool once they are added
void main() {
int k;
#pragma omp parallel num_threads(4)
{
#pragma omp single
k = subtreesize2(p, 0);
}
}Summary of the task execution model
parallel region created (p threads)
│
├── single: one thread starts the recursion (producer)
│ │
│ ├── task: child placed in task pool ──→ idle thread picks it up (consumer)
│ │ │
│ │ ├── task: grandchild placed in pool ──→ ...
│ │ └── taskwait: wait for grandchildren
│ │
│ ├── task: child placed in task pool ──→ idle thread picks it up
│ │ └── ...
│ │
│ └── taskwait: wait for both children
│
└── implicit barrier: all threads synchronize
The key insight is that the team of threads acts as a shared task-execution workforce. One thread seeds the task pool; all threads collaboratively consume and produce tasks until the recursion tree is fully explored.
Potential exam questions
- Describe the
taskdirective and the producer-consumer mechanism. What are the three components of a task? How does a task get from creation to execution? - What does the
if(condition)clause do on ataskdirective? Describe both cases (condition true and condition false). Why is this clause critical for performance? - Write pseudocode for a naive parallel binary tree computation using tasks. Explain why it is inefficient. How can it be improved?
- What is
taskwait? How does it differ from a generalbarrier? What happens iftaskwaitis omitted in a recursive task-parallel computation? - Why must the
taskdirective appear inside aparallelregion? What happens if it does not? - Why must the initial recursive call be wrapped in
#pragma omp single? What would happen without it? - Describe the complete mandatory setup pattern for task parallelism:
parallel+single+task+taskwait. Write a minimal working example. - How does
cancel taskgroupwork for early termination of task-parallel search? Why isatomic writeneeded when writing the result? - Compare data parallelism (
fordirective) with functional parallelism (taskdirective). When is each appropriate? - In Code 11, the
if(h < THRESHOLD)clause is used. Explain the tradeoff: what happens ifTHRESHOLDis too small? Too large?