First 5 minutes of hell
Sequential time is , firstly final histogram array of length
rangeis initialized to null values and then we go throughnvalues and increasing respective positions in the final histogram.There are two different ways, how to handle parallelization of histogram computation (codes below):
- having one shared histogram and using
atomic updateto access it
- less memory load, but atomic adds overhead and it sequentializes the algorithm a bit
- can cause false sharing (depends on data distribution, if many threads will write into close locations in the final histogram)
- having multiple local histograms, which are accumulated separately and then reduced in the end
- more memory usage as there are multiple local copies of the histogram
The local computation and then global reduction is an often used trick.
Problem definition
- histogram expresses frequencies of appearances of data values
- we expect values in the input array that are in some range (e.g. 0-256 for 8-bit grayscale images)
- values from the range serve as indexes
- the final histogram assigns frequencies of given value from the range
- standard sequential algorithm has complexity
O(n) + O(range)
Sequential algorithm (Code HG-Seq)
int *A, n, // input array and its size
histogram[range]; // resulting histogram
for (int i = 0; i < range; i++) // histogram initialization
histogram[i] = 0;
for (int j = 0; j < n; j++) // histogram computation
histogram[A[j]]++;- increment operation
histogram[A[j]]++involves three non-atomic sub-operations: read the current value, add one, write back- this needs to be an atomic operation in parallel algorithms to avoid race conditions
Variant 1: Parallel computation of a shared histogram (HG-Shrd)
int *A, n;
int histogram[range];
for (int i = 0; i < range; i++) // histogram initialization
histogram[i] = 0; // master thread
#pragma omp parallel for schedule(static) shared(histogram)
{ for (int j = 0; j < n; j++) { // histogram computation
#pragma omp atomic update
histogram[A[j]]++;
} }- parallel time:
T(n, range, p) = O(range) + O(n/p) - no additional memory and no additional computation is needed (beyond the sequential algorithm)
- the atomic update is essential, otherwise the race conditions would introduce false results
- performance issues:
- Serialization on atomic operations: When multiple threads happen to update the same histogram index (because they encounter the same value in their input portions), they are serialized by the atomic operation - each must wait for the previous one to complete.
- False sharing is unavoidable: Since input data values serve as indexes (indirect addressing), two different threads can write to adjacent histogram positions at any time. Whether false sharing occurs depends entirely on the data values, not on the scheduling strategy. No clever choice of
schedulecan eliminate it- this is data-dependent, we could build the histogram without any false sharing and we could trigger false sharing with each access
Variant 2: Parallel reduction of local partial histograms (HG-Local)
The idea is a time-space tradeoff: each thread gets its own private histogram copy in shared memory, computes locally without any conflicts, and then all threads cooperate to reduce the local histograms into a single global result.
int *A, n,
histogram[p][range]; // p = # of threads
#pragma omp parallel
{ int my_id = omp_get_thread_num();
for (int i = 0; i < range; i++) // each thread initializes
histogram[my_id][i] = 0; // sequentially its own histogram
#pragma omp for schedule(static)
for (int j = 0; j < n; j++) // histogram computation
histogram[my_id][A[j]]++;
#pragma omp for schedule(static)
for (int i = 0; i < range; i++) // each thread reduces
for (int j = 1; j < p; j++) // range/p p-tuples of local
histogram[0][i] += histogram[j][i]; // counters
}- phases:
- Initialization (
O(range)per thread, all in parallel): Each thread zeroes out its own row of the 2D histogram array. - Parallel computation (
O(n/p)): Threads write into their own local histograms only - no side effects to other threads, no atomic operations needed, no false sharing during this phase. - Parallel reduction (
O(range)): Therangecolumns of the 2D histogram are split amongpthreads. Each thread sumsrange/pcolumns across allprows intohistogram[0].- there is no race condition, since no thread writes to the same location there (only the outer loop is parallelized, the inner loop is sequential and writes only to [0][i], which is split by the parallel for)
- Initialization (
Parallel time
- The initialization and reduction phases each cost
O(range)parallel time. The main computation phase costsO(n/p).
Comparison of variants
| Aspect | HG-Shrd (shared) | HG-Local (local + reduction) |
|---|---|---|
| Additional memory | None | p × range (p-fold) |
| Atomic operations | Required (#pragma omp atomic update) | Not needed |
| False sharing | Unavoidable (data-dependent indirect addressing) | None during main computation |
| Parallel time | O(range) + O(n/p) | O(range) + O(n/p) + O(range) |
| Data dependency | Performance depends strongly on input data distribution | Performance independent of data distribution |
| Scheduling | static preferable | static preferable |
- if is large or are large, the second variant would require large space allocations (for their own copies of the histogram)
Key design principles illustrated by histogram parallelization
The histogram example demonstrates several general principles:
- Local computation + global reduction is a very common trick: do the heavy computation purely locally to produce partial results, then reduce (globalize) them at the end. The lecturer likened it to each thread doing its “homework” privately, then “publishing” the results.
- Indirect addressing defeats false sharing elimination: When write targets are determined by input data values rather than by the loop index, no scheduling or alignment strategy can prevent false sharing. The only recourse is to eliminate shared writes entirely (as HG-Local does).
- Time-space tradeoff: HG-Local trades
p-fold memory for elimination of atomics and false sharing. This tradeoff is worthwhile whenrangeandpare moderate. - Static scheduling is preferable in both variants because the per-iteration work is uniform (one increment per iteration in the main loop).
Potential exam questions
- Describe the sequential histogram computation algorithm and state its time complexity.
- Write OpenMP code for the shared-histogram parallel variant (HG-Shrd). Why is
#pragma omp atomic updatenecessary? What happens if it is omitted? - Explain why false sharing in the shared-histogram variant is unavoidable. Why can’t alignment or scheduling strategy eliminate it?
- Write OpenMP code for the local-histogram parallel variant (HG-Local). Describe the three phases and their complexities.
- Derive the parallel time complexity of HG-Local:
T(n, range, p) = O(range) + O(n/p) + O(range). Explain what each term corresponds to. - How does the parallel reduction phase work in HG-Local? For
p = 4andrange = 12, describe concretely what each thread computes during reduction. - Why is no atomic operation needed in HG-Local during either the computation or the reduction phase?
- Under what conditions is HG-Local infeasible? Give a concrete example with numerical values.
- Compare HG-Shrd and HG-Local in terms of memory overhead, synchronization overhead, false sharing, and data dependence of performance.
- Why is histogram computation classified as a memory-bound algorithm? What limits its parallel scalability?
- The lecturer described HG-Local as a “local computation + global reduction” pattern. Give another example (from the course material) where this pattern could be applied.
- If the histogram
rangeis 256,p = 8, andsizeof(int) = 4, how much additional memory does HG-Local require compared to HG-Shrd? Is this acceptable?