First 5 minutes in hell
Sequential “Lomuto’s Quicksort” works as follows:
- select a pivot = the last element of the unsorted array
- have two indexes and , which both traverse the array from left to right so that this invariant applies:
- all elements from first element to -th position are less than pivot
- all elements from to -th position are greater or equal to the pivot
- the last operation is swapping the pivot with -th position (so all elements on the right are less than pivot and all elements on the left are bigger than pivot)
- but after one round not all elements are sorted, we need to recurse to 1) left part of the array and 2) right part of the array (the pivot “stays” in place)
Straightforward parallelization (PUV):
- apply task parallelism on both recursions and begin with the
singledirective- problems:
- inefficient use of the task pool (even small OpenMP tasks are put there)
- the swapping procedure is sequential (so other threads could be idle), because the and depend on each other
- a thread creates 2 tasks and does not have own work to do (has to go to task pool and pick up the task it just created)
- solutions:
- Tail call optimization (TCO) - right side is processed iteratively by the current thread and the left side is packaged to a task and handed off to the task pool (so the parent thread still has work to do)
- Task parallelism thresholding (ST) - we will introduce a threshold of the form
n/(k*p)and if the number of array elements is under this threshold, solve it sequentially
- : exactly same number of tasks as threads, no effective load balancing could be done
- : allows for better load balancing (more tasks than threads in the pool), “not going too far with small, ineffective, tasks”
Hoare’s Quicksort
- because Lomuto’s Quicksort cannot be well parallelized ( and move sequentially and depend on each other)
- the process:
- goes left-to-right and goes right-to-left and again, the last element is the pivot
- there are 4 cases what could happen while comparing values on both pointers
- the process is called neutralization (= at least one element ends up in the right place)
- at the end, the pivot is swapped into the meeting point of and
- how it could be parallelized?
- each thread has a unique
my_iandmy_j, it neutralizes this pair and then takes another freemy_iandmy_j, which are not taken up by other threads (going closer to the center of the array)
- therefore the values must be shared (visible by other threads) and be updated by
atomic capturedirective- more effective way is not to take individual element pairs (a lot of atomic operations and false sharing), but capturing a neutralizing whole disjoint blocks of pairs
- the same rules apply:
- after seq. neutralization at most one dirty block (having displaced elements) and at least one clean block (having all elements in the correct part) is left
- the thread keeps the dirty block and gets another block (because the clean block is finished)
- the block size be a multiple of the cache block size
- at the end there are at most dirty blocks - they are neutralized sequentially by master thread (this is relatively cheap for small )
- pivot is not swapped as in the previous version, it is determined by the common pivot determination techniques (e.g. randomly sample 5 and take median)
- quicksort is data-dependable and the complexity depends on the quality of the pivot with respect to the data
- warning, the Hoarse Quicksort helps me to parallelize the sequential partition, but the recursive part of the algorithm is still there (so it is a nested parallelism (task + multithreaded par_partition)), so we need to set the maximum active levels to keep the number of threads optimal in relation to number of physical cores (best scenario is to have the same amount of threads as cores, I can go up to twice as much threads as core, but not more)
QuickSort properties relevant to parallelization
QuickSort is a recursive Divide-and-Conquer algorithm that is data sensitive (performance depends on input distribution), in-place (requires only O(log₂ n) auxiliary memory for the recursion stack in efficient implementations), and can be implemented exclusively with Compare-and-Swap operations. It is unstable - the ordering of equal elements is not preserved.
The key structural property for parallelization is that QuickSort performs its main work (partitioning) on the way down the recursion tree, before the recursive calls. This means the largest tasks appear at the top of the tree of recursive calls (TRC), which is favorable for parallel task creation - unlike MergeSort, which does work on the way back up.
All performance measurements in the lecture use arrays of 100 million random integers on a 12-core Intel Xeon E5-2680v3 (Haswell) with GNU GCC 5.4. Random data represents a favorable case for QuickSort.
Sequential baseline: Lomuto’s partitioning
The textbook sequential QuickSort uses Lomuto’s partitioning variant. Two indices i and j both traverse the array left-to-right. The pivot is always the last element A[hi]. The invariant maintained is:
After each iteration of j, all elements in A[lo, ..., i-1] are less than the pivot, and all elements in A[i, ..., j] are greater than or equal to the pivot.
long seq_partition_L(int* A, long lo, long hi) {
long pivot = A[hi];
long i = lo;
for (long j = lo; j < hi; j++)
if (A[j] < pivot)
swap(A, i++, j);
swap(A, i, hi); // pivot placement
return i; // return pivot index
}When the loop completes, a final swap places the pivot at position i, which becomes the splitter index. The sequential textbook version (SUV) achieves 11.46 s, close to the highly optimized GNU sequential version (SGNU) at 10.33 s.
Straightforward parallelization (PUV)
The standard task-parallel pattern for recursive algorithms is applied. A parallel region is created with a single thread initiating the recursion. Each recursive call is wrapped in #pragma omp task:
void par_quicksort(int* A, long lo, long hi) {
#pragma omp parallel
{
#pragma omp single
par_quicksort_rec(A, lo, hi);
}
}
void par_quicksort_rec(int* A, long lo, long hi) {
if (lo < hi) {
long r = seq_partition_L(A, lo, hi);
#pragma omp task
par_quicksort_rec(A, lo, r - 1);
#pragma omp task
par_quicksort_rec(A, r + 1, hi);
}
}PUV achieves only speedup 4.86 on 12 cores (2.36 s), far from PGNU’s speedup of 11.03 (0.94 s). Two main problems explain this:
- Creating a large number of small OpenMP tasks causes excessive task pool management overhead. When a thread generates two tasks and puts both into the pool, it has nothing to do itself - it must go back to the pool and potentially pick up a task it just created.
- The
seq_partition_Lfunction is sequential, so partitioning of the entire input array is performed by one thread while others remain idle at the initial levels of the TRC. This is Amdahl’s law in action.
Optimization 1: Tail call optimization (TCO)
One recursive call is replaced with an iteration loop. The second recursive call par_quicksort_rec(A, r+1, hi) becomes lo = r + 1 inside a while loop:
void par_quicksort_rec(int* A, long lo, long hi) {
while (lo < hi) {
long r = seq_partition_L(A, lo, hi);
#pragma omp task
par_quicksort_rec(A, lo, r - 1); // left part -> task pool
lo = r + 1; // right part -> continue
}
}Only one task is put into the pool (the left part) while the current thread keeps processing the right part directly. This approximately halves the number of OpenMP tasks created. If the pivot is close to the median, the number of OpenMP tasks drops from O(n) to O(log n).
A further optimization (not elaborated in the lecture) would be to put the larger part into the task pool and keep the smaller part, since uneven splits may otherwise waste work.
Optimization 2: Task parallelism thresholding (ST)
When the subarray size drops below a threshold n/(k*p), the algorithm switches to sequential QuickSort instead of creating new tasks. The parameter k controls over-decomposition:
k = 1produces exactlyptasks forpthreads, which does not allow load balancingk = 5ork = 10producesk*ptasks total, enabling automatic load balancing through the OpenMP runtime
The optimal k is system-dependent and must be determined experimentally.
void par_quicksort_rec2(int* A, long lo, long hi, const long seq_thr) {
while (lo < hi) {
if ((hi - lo) < seq_thr) {
seq_quicksort_L(A, hi, lo); // sequential fallback
return;
}
long r = seq_partition_L(A, lo, hi);
#pragma omp task
par_quicksort_rec2(A, lo, r - 1, seq_thr);
lo = r + 1;
}
}The threshold is computed at the beginning as seq_thr = (hi - lo + 1) / omp_get_num_threads() / k.
Optimization 3: Parallel partitioning (PP)
Why Lomuto’s variant cannot be parallelized
In Lomuto’s partitioning, each iteration depends on the previous one - both i and j advance sequentially left-to-right. There is no way to give meaningful independent work to p threads.
Hoare’s partitioning - the parallelizable alternative
In Hoare’s variant, index i traverses left-to-right and j traverses right-to-left. The process is called neutralization. Four cases arise when comparing A[i] with pivot and A[j] with pivot:
A[i] < pivotandA[j] >= pivot- both correctly placed, advance both pointers inwardA[i] < pivotandA[j] < pivot- left element correct, advanceirightwardA[i] >= pivotandA[j] >= pivot- right element correct, advancejleftwardA[i] >= pivotandA[j] < pivot- swap both, advance both pointers inward
In each iteration, at least one array element is neutralized. The key insight is that neutralization operations on different pairs of elements are independent - the only requirement is that no two threads work on the same array element.
long seq_partition_H2(int* A, long lo, long hi) {
long pivot = A[hi];
long i = lo;
long j = hi - 1;
while (i < j) {
if (A[i] >= pivot && A[j] < pivot) {
swap(A, i, j); i++; j--;
} else {
if (A[j] >= pivot) j--;
if (A[i] < pivot) i++;
}
}
if (A[j] < pivot) j++;
swap(A, j, hi);
return j;
}Parallel partitioning with atomic capture
The indices i and j become shared variables. Each thread must atomically capture a unique value of i or j before working on that element. The naive attempt without synchronization is incorrect because increment and decrement are three operations (read, update, write back) and are not atomic - multiple threads can capture the same value.
The correct solution uses #pragma omp atomic capture (Fetch-and-Add):
#pragma omp atomic capture
{ my_i = i; i++; } // abbreviated as [my_i = i++]
// the thread takes up the current i and increases it by one atomically (so other threads can take up other i)
#pragma omp atomic capture
{ my_j = j; j--; } // abbreviated as [my_j = j--]Each thread captures unique private copies my_i and my_j, performs neutralization on A[my_i] and A[my_j], then captures new elements as needed. At the end, each thread is left with at most one dirty (incorrectly placed) element. Since p << n, sequential cleaning of at most p dirty elements is fast.

Why element-level partitioning is inefficient
Partitioning by individual array elements causes frequent atomic accesses into shared memory and thrashing of cache blocks by different threads (false sharing). With 100 million numbers and 12 threads, each partitioning step involves roughly 50 million neutralizations - doing an atomic capture for each is prohibitively expensive.
Block-level parallel partitioning
The solution is to capture and process disjoint blocks of K elements at a time:
#pragma omp atomic capture
{ my_i = i; i += K; } // capture a block of K elementsKey definitions and properties:
- Dirty block: contains elements both less than and greater than or equal to the pivot
- Clean block: all elements are correctly placed relative to the pivot
- After sequential neutralization of a left block
B_L(k)and a right blockB_R(k), at least one block becomes clean and correctly placed - Each thread ends with at most 1 dirty block
- Ideally,
K * sizeof(int)should be a multiple of the cache block size
The thread traverses its left block with pointer i_k left-to-right and its right block with pointer j_k right-to-left, performing the same neutralization logic as in the single-element case but within block boundaries. When one block becomes clean, the thread captures a new partner block and continues.
par_partition_block pseudocode:
1. Each thread atomically captures a left and right block of size K
2. Loop: neutralize the block pair sequentially
- If left block is clean: capture new left block
- If right block is clean: capture new right block
3. Barrier
4. Sequential neutralization of remaining <= p dirty blocks
5. Return boundary index
After the main parallel loop and a barrier, the master thread sequentially cleans the remaining dirty blocks (at most p of them).
Nested parallelism for parallel partitioning
Parallel partitioning requires nested OpenMP parallelism: the thread executing a task from the task pool must create its own team for multi-threaded partitioning. This is activated via omp_set_max_active_levels(depth) where depth > 1.
At the root of the TRC, all p threads can participate in the initial partitioning since the task pool is empty. Deeper in the tree, fewer threads are available per partition since work is distributed across multiple concurrent nodes. The maximum nesting depth must be controlled to avoid creating more threads than cores.
The number of real threads with respect to the number of real cores should be more or less the same. Maybe you could have twice as many threads as cores, but not more.
Pivot selection
In block-level neutralization, there is no swap of the pivot as the last element with the boundary element. Only the pivot value is needed during partitioning. The master thread returns the boundary index after cleaning dirty blocks, and the pivot ends up somewhere in the right part.
This decoupling allows using any pivot selection method independently of the partitioning mechanism:
- Median of 3 or 5 randomly selected elements
- Median of a larger number of equidistant element samples
Good pivot selection is particularly important at the top levels of the TRC, where poor pivots cause severely unbalanced splits.
Load balancing for nested parallel regions
QuickSort’s data sensitivity complicates load balancing. The ratio of the left and right part sizes after partitioning depends on pivot quality. The idea is to pass the left/right size ratio as an argument to the recursive function, then use the num_threads clause of #pragma omp parallel to allocate threads proportionally. For example, a 2:5 split should assign threads in roughly a 2:5 ratio. Each part must receive at least 1 thread.
Final performance evaluation
Sorting 100 million random integers on 12-core Intel Xeon E5-2680v3:
| Variant | Runtime | Speedup (vs SUV) |
|---|---|---|
| SGNU | 10.33 s | - |
| SUV | 11.46 s | 1.00 |
| PGNU | 0.94 s | 11.03 |
| PUV | 2.36 s | 4.86 |
| TCO/ST | ~1.7 s | ~6.7 |
| TCO/ST/PP | ~1.0 s | ~11 |
TCO/ST/PP (all three optimizations combined) brings performance to within a few percent of PGNU. The remaining gap is due to additional optimizations in GNU’s libstdc++: cache memory effects, switching to InsertionSort for small inputs, better pivot selection, and other refinements.
This near-parity with GNU is specific to random inputs. For non-random data distributions (sorted, reverse-sorted, many duplicates), GNU’s implementation would significantly outperform the textbook version due to its many additional optimizations.
Summary of the three optimizations
- TCO (Tail Call Optimization): replace one recursive call with iteration, halving task creation. Reduces OpenMP tasks from
O(n)toO(log n)for median pivots. - ST (Sequential Thresholding): switch to sequential sort below threshold
n/(k*p), avoiding overhead of tiny tasks while enabling load balancing through over-decomposition. - PP (Parallel Partitioning): use Hoare’s neutralization with block-level atomic capture and nested parallelism to parallelize the partitioning step, addressing the Amdahl’s law bottleneck.
Potential exam questions
- Why is Lomuto’s partitioning unsuitable for parallelization, and what property of Hoare’s partitioning makes it parallelizable?
- Explain the concept of neutralization in Hoare’s partitioning. What are the four cases, and why is at least one element neutralized per iteration?
- Why is the first attempt at parallel partitioning (with shared
i,jwithout synchronization) incorrect? What OpenMP directive fixes it? - What is the purpose of
#pragma omp atomic capturein parallel partitioning? How does it differ from#pragma omp critical? - Explain why element-level parallel partitioning is inefficient and how block-level partitioning solves this problem. What is the dirty/clean block invariant?
- Why does parallel partitioning in QuickSort require nested OpenMP parallelism? How is nesting depth controlled and why must it be limited?
- Describe the three optimizations (TCO, ST, PP) applied to the naive parallel QuickSort. What specific problem does each one address?
- How does tail call optimization reduce the number of OpenMP tasks from
O(n)toO(log n)? What assumption about the pivot is required? - What is the role of the parameter
kin the thresholding formulan/(k*p)? What happens whenk = 1versusk = 10? - How is pivot selection decoupled from block-level partitioning? Why is good pivot selection particularly important at the top levels of the TRC?
- Explain the load balancing problem for nested parallel regions in parallel QuickSort. How can threads be allocated proportionally to subarray sizes?
- Compare the parallel performance of PUV, TCO/ST, and TCO/ST/PP. What is the main bottleneck at each stage and how is it resolved?