First 5 minutes of hell

Functional (task) parallelism utilizes tasks (= units of parallel computation) to enable effective parallelization of recursive algorithms. When a thread (producer) encounters the #pragma omp task directive, it generates a new child task and stores it into the task pool. The task will wait there until picked up by another free thread (consumer) which will execute it.

The #pragma omp task must be inside a parallel region (so there are multiple threads to pick up the tasks), otherwise all tasks would be done by one thread only (so sequentially with additional overhead)

#pragma omp taskwait

  • the parent task must wait for all child tasks to complete before continuing

#pragma omp single

  • this directive ensures that the code will be executed by just one thread (out of threads ready - OS decides, which thread) to avoid running multiple and identical recursion trees in parallel
    • other threads skip the code block, but they must wait at the end (implicit barrier)
    • and there would also be increased overhead due to access conflicts to the shared memory (and more cache misses)

Important clauses:

  • if condition - whether the producer thread should create a task or run the task code directly (and then continue with it’s work)
  • final(expression) - if the expression is true, then this task is final and no further explicit tasks will be created
  • priority(expression) - assigns a priority to the child task
  • ideal for parallelizing recursive algorithms (Divide and conquer)

The task directive - syntax and semantics

#pragma omp task [clause[ [,] clause] ...]
{ task structured block }
  • a task is a unit of parallel computation consisting of:
    • Code to be executed (the structured block)
    • Input data (captured at task creation time)
    • A data structure for storing the identity of the consumer thread once it starts executing the task
  • the producer-consumer mechanism
    1. A thread executing a parent task encounters #pragma omp task. This producer thread generates a new child task and stores it into the task pool.
    2. The child task waits in the task pool until some free thread withdraws it and becomes the consumer thread that executes it.
    3. The consumer thread may itself encounter task directives during execution, becoming a producer in turn. This cycle continues until all tasks are completed.
  • the task clause must be in the parallel section, otherwise the whole task mechanism would be performed on a single thread (so sequentially and with additonal overhead)
    • a normal sequential solution would be better here

Clauses of the task directive

  • we have to balance the granularity of the tasks using the if (condition) directive

    • what should be pushed into the child task pool (with the overhead) and what should be executed “on the spot” sequentially by the parent thread?
    • balancing parallelism overhead vs. useful work
    • e.g. when parallelizing the recursive binary tree
      • it makes sense to parallelize the top big subtrees, but the lower short subtrees are better handled sequentially (lower overhead, faster)
  • variable properties

    • variable values are captured at the task creation (default is firstprivate(list))
  • taskwait directive

    • a synchronization point where the parent thread must wait for all of it’s direct child tasks to complete
    • it is a specific kind of barrier, but just for the child tasks generated by the current task, not for all tasks in the system
    • this directive is essential for the correctness in recursive algorithms
      • e.g. parent needs to wait on the results of the recursive calls to sum them up
        • here the sequential approach waits naturally, but the parallel could go on without the child completing and therefore, having uninitialized values
  • single directive is used for the first recursive call

    • the first recursive call must be called by just one thread, otherwise it would be called by all available threads, performing identical recursion trees being executed redundantly
    • the rest of the threads act only as consumers at the start and pick tasks from the task pool once they are added
void main() {
    int k;
    #pragma omp parallel num_threads(4)
    {
        #pragma omp single
        k = subtreesize2(p, 0);
    }
}

Summary of the task execution model

parallel region created (p threads)
    │
    ├── single: one thread starts the recursion (producer)
    │       │
    │       ├── task: child placed in task pool ──→ idle thread picks it up (consumer)
    │       │       │
    │       │       ├── task: grandchild placed in pool ──→ ...
    │       │       └── taskwait: wait for grandchildren
    │       │
    │       ├── task: child placed in task pool ──→ idle thread picks it up
    │       │       └── ...
    │       │
    │       └── taskwait: wait for both children
    │
    └── implicit barrier: all threads synchronize

The key insight is that the team of threads acts as a shared task-execution workforce. One thread seeds the task pool; all threads collaboratively consume and produce tasks until the recursion tree is fully explored.


Potential exam questions

  1. Describe the task directive and the producer-consumer mechanism. What are the three components of a task? How does a task get from creation to execution?
  2. What does the if(condition) clause do on a task directive? Describe both cases (condition true and condition false). Why is this clause critical for performance?
  3. Write pseudocode for a naive parallel binary tree computation using tasks. Explain why it is inefficient. How can it be improved?
  4. What is taskwait? How does it differ from a general barrier? What happens if taskwait is omitted in a recursive task-parallel computation?
  5. Why must the task directive appear inside a parallel region? What happens if it does not?
  6. Why must the initial recursive call be wrapped in #pragma omp single? What would happen without it?
  7. Describe the complete mandatory setup pattern for task parallelism: parallel + single + task + taskwait. Write a minimal working example.
  8. How does cancel taskgroup work for early termination of task-parallel search? Why is atomic write needed when writing the result?
  9. Compare data parallelism (for directive) with functional parallelism (task directive). When is each appropriate?
  10. In Code 11, the if(h < THRESHOLD) clause is used. Explain the tradeoff: what happens if THRESHOLD is too small? Too large?