Performance Optimization for the Origin 2000

9/9/98


Click here to start


Table of Contents

Performance Optimization for the Origin 2000

Outline

Performance

Performance Examples

What is Optimization?

Types of Optimization

Steps of Optimization

Performance Strategies

The 80/20 Rule

How high is up?

Don't Sweat the Small Stuff

Considerations when Optimizing

Origin 2000 Architecture

Origin 2000 Architecture

R10000 Architecture

R10000 Architecture

Cache Architecture

Caches exploit Locality

Cache Benchmark

Cache Performance

Cache Mapping

2 way Set Associative Cache

What is a TLB?

O2K Memory Hierarchy

Origin 2000 Access Times

Performance Metrics

What about MFLOPS?

What do we use for evaluation

Performance Metrics

Fallacies

Basis for Performance Analysis

Asymptotic Analysis

Amdahl's Law

Amdahl's Law

Amdahl's Law

Efficiency

Issues in Performance

Issues in Performance

Problem Size and Precision

Parallel Performance Issues

Understanding Compilers

Compiler Technology

Recommended Flags

Accuracy Considerations

Compiler flags

Roundoff example

Exceptions

Exception profiling

Aliasing

Advanced Aliasing

Advanced Aliasing

Software Pipelining

Interprocedural Analysis

IPA features

Inlining

Manual Inlining

Loop Nest Optimizer

LNO functionality

Optimized Arithmetic Libraries

Numerical Libraries

CHALLENGEcomplib and SCSL

LAPACK

ScaLAPACK

PETSc

O2K Performance Tools

External Timers

External Timers

Internal Timers

Internal Timers

Hardware Performance Counters

Some Hardware Counter Events

Hardware Performance Counter Access

Origin Counter API

Perfex usage

Perfex features

Speedshop

Speedshop Components

Speedshop Usage

Speedshop Sampling

SpeedShop Sampling

Speedshop Counting

Ideal Experiment

Prof Usage

ideal Experiment Example

pcsamp Experiment Example

usertime Experiment Example

Gprof Usage

Gprof information

Exception Profiling

Address Space Profiling

dprof

Parallel Profiling

Parallel Profiling

CASEVision Debugger

Outline

Guidelines for Performance

Array Optimization

Memory Access

Array Allocation

Array Referencing

Array Initialization

Array Initialization

Array Padding

Array Padding
a = a + b * c

Stride Minimization

Stride Minimization

Stride Minimization

Loop Fusion

Loop Fusion

Loop Fusion

Loop Interchange

Loop Interchange

Loop Interchange

Floating IF's

Floating IF's

Floating IF's

Loop Defactorization

Gather-Scatter Optimization

Gather-Scatter Optimization

IF Statements in Loops

Loop Defactorization

Loop Defactorization

Loop Defactorization

Loop Peeling

Loop Peeling

Loop Peeling

Loop Collapse

Loop Collapse

Loop Collapse

Loop Collapse

Loop Unrolling

Loop Unrolling

Loop Unrolling

Loop Unrolling

Loop Unrolling and Sum Reductions

Loop Unrolling and Sum Reductions

Loop Unrolling and Sum Reductions

Loop Unrolling and Sum Reductions

Outer Loop Unrolling

Outer Loop Unrolling

Outer Loop Unrolling

Outer Loop Unrolling

Cache Blocking

Cache Blocking

Loop structure

Strength Reduction

Strength Reduction
Horner's Rule

Strength Reduction
Horner's Rule

Strength Reduction
Integer Division by a Power of 2

Strength Reduction
Integer division by a Power of 2

Strength Reduction
Integer division by a Power of 2

Strength Reduction
Factorization

Strength Reduction
Factorization

Strength Reduction
Factorization

Subexpression Elimination
Parenthesis

Subexpression Elimination
Parenthesis

Subexpression Elimination
Type Considerations

Subexpression Elimination
Type Considerations

F90 Considerations

C/C++ Considerations

dplace Usage

Parallel Optimization

Choosing a Data Distribution

Possible Data Layouts

Two-dimensional Block-Cyclic Distribution

Load Balancing

MPP Optimization

Parallel Performance

Message Passing APIs

Message Passing APIs

Message Passing Interface

Message Passing

Message Passing

Communication Issues

Communication Issues

Communication Issues

Message Passing

Message Passing

MPI Message Passing

MPI Message Passing

MPI Message Passing

MPI Optimizations

MPI Data Types

MPI Collective Communication

MPI Collective Communication

Message Passing Optimizations

Message Passing Optimization
Nearest Neighbor Example 1

Message Passing Optimization
Nearest Neighbor Example 2

Message Passing Optimization
Nearest Neighbor Example

MPI Message Passing

Automatic Parallelization

Automatic Parallelization

Data Parallelism

Data Parallelism on the SGI's

Data Parallelism on the SGI's

Data Parallelism on the SGI's

Task Parallelism

Task Parallelism

Limits on Parallel Speedup

Parallel Overhead

Parallel Overhead

Reducing Parallel Overhead

Reducing Parallel Overhead

Improving Load Balance

Improving Load Balance

Improving Load Balance

Improving Load Balance

Additional Material

HTTP References

References

Author: Philip J. Mucci

Email: mucci@cs.utk.edu

Home Page: http://www.cs.utk.edu/~mucci

Author: Kevin S. London

Email: london@cs.utk.edu

Home Page: http://www.cs.utk.edu/~london

Download presentation source