Reducing Overhead
The coarser the grain, the better. Why? Our architectures really trade bandwidth for latency.
- The compiler must aggregate data for transfer.
Combine multiple DO directives
- More work per parallel region, reduce synchronization.
Replicated execution is ok.