Table of Contents
Performance Optimization for the Origin 2000
Outline
Performance
Performance Examples
What is Optimization?
Types of Optimization
Steps of Optimization
Performance Strategies
The 80/20 Rule
How high is up?
Don't Sweat the Small Stuff
Considerations when Optimizing
Origin 2000 Architecture
Origin 2000 Architecture
R10000 Architecture
R10000 Architecture
Cache Architecture
Caches exploit Locality
Cache Benchmark
Cache Performance
Cache Mapping
2 way Set Associative Cache
What is a TLB?
O2K Memory Hierarchy
Origin 2000 Access Times
Performance Metrics
What about MFLOPS?
What do we use for evaluation
Performance Metrics
Fallacies
Basis for Performance Analysis
Asymptotic Analysis
Amdahl's Law
Amdahl's Law
Amdahl's Law
Efficiency
Issues in Performance
Issues in Performance
Problem Size and Precision
Parallel Performance Issues
Understanding Compilers
Compiler Technology
Recommended Flags
Accuracy Considerations
Compiler flags
Roundoff example
Exceptions
Exception profiling
Aliasing
Advanced Aliasing
Advanced Aliasing
Software Pipelining
Interprocedural Analysis
IPA features
Inlining
Manual Inlining
Loop Nest Optimizer
LNO functionality
Optimized Arithmetic Libraries
Numerical Libraries
CHALLENGEcomplib and SCSL
LAPACK
ScaLAPACK
PETSc
O2K Performance Tools
External Timers
External Timers
Internal Timers
Internal Timers
Hardware Performance Counters
Some Hardware Counter Events
Hardware Performance Counter Access
Origin Counter API
Perfex usage
Perfex features
Speedshop
Speedshop Components
Speedshop Usage
Speedshop Sampling
SpeedShop Sampling
Speedshop Counting
Ideal Experiment
Prof Usage
ideal Experiment Example
pcsamp Experiment Example
usertime Experiment Example
Gprof Usage
Gprof information
Exception Profiling
Address Space Profiling
dprof
Parallel Profiling
Parallel Profiling
CASEVision Debugger
Outline
Guidelines for Performance
Array Optimization
Memory Access
Array Allocation
Array Referencing
Array Initialization
Array Initialization
Array Padding
Array Padding a = a + b * c
Stride Minimization
Stride Minimization
Stride Minimization
Loop Fusion
Loop Fusion
Loop Fusion
Loop Interchange
Loop Interchange
Loop Interchange
Floating IF's
Floating IF's
Floating IF's
Loop Defactorization
Gather-Scatter Optimization
Gather-Scatter Optimization
IF Statements in Loops
Loop Defactorization
Loop Defactorization
Loop Defactorization
Loop Peeling
Loop Peeling
Loop Peeling
Loop Collapse
Loop Collapse
Loop Collapse
Loop Collapse
Loop Unrolling
Loop Unrolling
Loop Unrolling
Loop Unrolling
Loop Unrolling and Sum Reductions
Loop Unrolling and Sum Reductions
Loop Unrolling and Sum Reductions
Loop Unrolling and Sum Reductions
Outer Loop Unrolling
Outer Loop Unrolling
Outer Loop Unrolling
Outer Loop Unrolling
Cache Blocking
Cache Blocking
Loop structure
Strength Reduction
Strength Reduction Horner's Rule
Strength Reduction Horner's Rule
Strength Reduction Integer Division by a Power of 2
Strength Reduction Integer division by a Power of 2
Strength Reduction Integer division by a Power of 2
Strength Reduction Factorization
Strength Reduction Factorization
Strength Reduction Factorization
Subexpression Elimination Parenthesis
Subexpression Elimination Parenthesis
Subexpression Elimination Type Considerations
Subexpression Elimination Type Considerations
F90 Considerations
C/C++ Considerations
dplace Usage
Parallel Optimization
Choosing a Data Distribution
Possible Data Layouts
Two-dimensional Block-Cyclic Distribution
Load Balancing
MPP Optimization
Parallel Performance
Message Passing APIs
Message Passing APIs
Message Passing Interface
Message Passing
Message Passing
Communication Issues
Communication Issues
Communication Issues
Message Passing
Message Passing
MPI Message Passing
MPI Message Passing
MPI Message Passing
MPI Optimizations
MPI Data Types
MPI Collective Communication
MPI Collective Communication
Message Passing Optimizations
Message Passing Optimization Nearest Neighbor Example 1
Message Passing Optimization Nearest Neighbor Example 2
Message Passing Optimization Nearest Neighbor Example
MPI Message Passing
Automatic Parallelization
Automatic Parallelization
Data Parallelism
Data Parallelism on the SGI's
Data Parallelism on the SGI's
Data Parallelism on the SGI's
Task Parallelism
Task Parallelism
Limits on Parallel Speedup
Parallel Overhead
Parallel Overhead
Reducing Parallel Overhead
Reducing Parallel Overhead
Improving Load Balance
Improving Load Balance
Improving Load Balance
Improving Load Balance
Additional Material
HTTP References
References
|