LFC_PTOOLS.ppt

LAPACK for Clusters (LFC)
Kenneth J. Roche
The University of Tennessee,
Department of Computer Science
roche@cs.utk.edu, lfc@cs.utk.edu http://www.cs.utk.edu/~roche

PTOOLS, 11 September 2002; Knoxville, Tennessee

“If I have seen a little farther than others it is because
I have stood on the shoulders of giants.”

-Sir Isaac Newton (1642-1727)

(E.T. Bell, Men of Mathematics -1937)

Revolutionary software –no joke! (fine, at least really useful)
Serial, LINPACK and EISPACK restructured to yield LAPACK

-freely available software which addresses mathematical
problems from linear algebra (Ax=b, Ax=sx, …)

-highly tuned for accuracy and efficiency

-built on the strength of the BLAS (Basic Linear Algebra
Subprograms) routines (sAB+tC  C, …)

-later largely rebuilt exploiting recursion and modern memory
hierarchies (ATLAS)

Matrix – Matrix
Multiplication
O(n*n*n)

~3/4 peak out
of the box -ATLAS
the basis for
block based
algorithms
in linear algebra

Gaussian
Elimination
O(n*n*n)

~2/3 peak out
of the box
LAPACK(ATLAS)
recursive, block
algorithm rich
in BLAS 3

An aside -on time

Semi-conductors, (Si, Ge), insulators at low T, conductors when
excited by a current -valence electrons are involved in covalent
bonding with four nearest neighbors, when excited there is a
non-zero probability that electrons can make the transistion to
a conduction band (reference parking garage)
pure - valence band holes and conduction band electrons
are equal in number
impure- doping is the process of generating excess electrons
or holes by adding an impurity atom to a pure s-c. (donor/acceptor)
sites are defined by the Fermi level and energy gap of the material
(n,Arsenic(Z=33) with Germanium(Z=32), or p,Gallium (Z=31) )

p-n junction diodes, the basis for the integrated circuit chip, when
connected to an external circuit, the potential difference across the junction
can be varied imparting a forward bias in the current of the circuit from
p to n

Nano-scale technology and physical limits ??
Images of the first transistor and a research scientist at IBM
in the design phase of ICCs (Integrated Circuit Chips)
First transistor, Bardeen, Brattain (Shockley, Bell Labs ~1947),
p-n-p type junction settings, power amplifiers

Perspective …

mks unit of time, the second, s

assume c~3.e8 m/s is the fastest we can send and receive
signals from any distance –don’t worry about special relativity
or thermal (magnetoresistance) properties for now … also
no quantum effects –naive!

Anyway, consider the length scale of modern devices, ~100nm

Will software clocks be able to measure time on the scale
T ~ 100nm*(1.e-9m/1nm) / 3.e8m/s ~ 3.333..E-16 seconds ??
(PAPI?)

Wall …

CPU …

The resolution
of each of
the timers
is acceptable
for BLAS3
for problem
sizes > 250
here …

User_Code() {

Generate_Data(data(out));

Numerical_Application_Routine(data(in/out),
routine_parameters(in));

Use_Solution(); /* Other … */

Clean_Up(); /* free memory, close files, etc. */

Exit();

}

Complications arise as user problems become more
demanding on physical storage requirements or
computing expense (flops)

Parallel, redesign LAPACK for modern distributed computing
environments, ScaLAPACK!

-freely available, highly tuned for accuracy and efficiency

-built on the strength of the PBLAS (Parallel Basic Linear
Algebra Subprograms) routines

-utilizes a general data decomposition scheme which reduces
load imbalance and data movement during computations

-communications are built on the BLACS (Basic Linear Algebra
Communications Subprograms) (built on the MPI standard)

However, …

More complicated code structures accompany the increased
computing power enabled by the ScaLAPACK approach,

-MPI environment initialization and clean-up

-BLACS logical process grid initialization and clean-up

-2d block cyclic data distribution of matrix and vector elements

More complicated compilations as well,

-assumes the proper installation of the BLACS, PBLAS,
ScaLAPACK, LAPACK, BLAS (ATLAS), … libraries

-assumes these libraries are linked properly at compile time
(and built properly at the onset)

Example complication for users:

natural data  2d block cyclically mapped data

2d block cyclically mapped data  natural data

Natural data example: column-major ordering
A(1,1) A(1,2) A(1,3) … A(1,n)
A(2,1) A(2,2) A(2,3) … A(2,n)
………………………………..
A(m,1)A(m,2)A(m,3) … A(m,n)
A = (
)
On paper,
Storage requirements,
m*n*sizeof(double) bytes
In memory,
A(1,1)A(2,1)…A(m,1)A(1,2)A(2,2)…A(m,n-1)A(1,n)A(2,n)…A(m,n)

Although subroutines exist to assist the user in the mapping
of his natural data into the expected 2d block cyclic
decomposition, experience suggests many users find
it confusing to initialize array descriptors and call the
appropriate routines
The problem here is …
-This is certainly a problem which we would like to address.

Intermission …

On LFC …

Life with LFC

Beowulf Clusters: (http://www.beowulf.org , http://www.ieeetfcc.org )

history: 1994, Thomas Sterling and Don Becker, The Center of Excellence in
Space Data and Information Sciences (CESDIS) and the Earth and Space
Sciences (ESS), 16 DX4 processors connected by channel bounded Ethernet
(CbEthernet, network traffic is striped across 2 or more Ethernets since switches
were expensive and processors too fast for a single Ethernet,)

today: emergence of COTS (commodity off the shelf) systems which are
assembled sub-systems of micro-processors, motherboards, disks, and network
interface cards, channel based communications replaced by switches in the
cluster computing environment -faster switches and communication lines continue
to emerge

taxonomy: somewhere between MPP (massively parallel processors) and NOWs
(in particular, nodes are dedicated to the cluster, the network is isolated from the
external network, the interconnection network for the cluster is invisible to the
outside world in BC -not so in NOWs ...claim is any parallel program which runs
on a NOW will run at least equally as well on a BC)

Beowulf Clusters vs. High Performance Vendor Supplied Systems:

Coupling vendor hardware and software forces system software to be perpetually
immature -the result is that application software is in a constant state of development
for these systems. The goal for BCs is a drive towards some standard programming
model in which programmers can rely upon their software running despite the
maker of the underlying hardware -Do programming languages help here? Either
way the HPC community moans a great deal about inadequate software.

BCs are cheeper to build, allow the architect more options, and should evolve into
balanced systems more quickly because many groups are now sharing their
experiences in these efforts (as opposed to proprietary vendor related issues). On
the flip side, removing the vendor imparts real burdens on system administrators and
software developers. Vendors I have dealt with work closely with the application
developers and provide good support for hardware and design issues -this frees
up a lot of time to worry about other things -and gives users/application developers
someone other than themselves to blame for yukky outcomes of efforts. :)

Vendor experiences with complex systems will be critical to accelerating the effort
to balancing and building scalable computing systems of the future. Sharing is good.

~ Mbit Switch,
(fully connected)
~ Gbit Switch,
(fully connected)
Remote memory server,
e.g. IBP (TCP/IP)
Local network file server,
SUN’s NFS (UDP/IP)
e.g. 100 Mbit
Users, etc.
LFC sample computing environment:

User creates data
-incore
-out-of-core

User calls LFC routine

User uses solution or
whatever

User exits
Checks that the daemon is active

Gathers relevant cluster data
-active processors
-cpu, memory, network io,
cluster ebw and latency, blas3

Performance modeler for specific
problem (resource selection)

Dresses command line, forks, execs,
and waits

Cleans up and returns control to user

Initialize parallel
environment

Gets data incore with
the proper 2d distribution

Invokes the parallel
application kernel

Backward maps the solution

Releases the parallel
environment

Clean exit
(serial)
(parallel)
(serial)

processor discovery();
memory();
cpu loads();
communications();
blas_3();
i/o();
start

LFC_daemon memory investigations:
Cache and Unpaged Physical

Data collected by the
daemon forms a
stochastic time series in
each of the observables.

Outstanding questions regarding the data collected …
Should we stockpile time dependent data for each observable?

If so, how shall said data be analyzed?
-ARMAs, Martingales, some physical heuristic –e.g.
Brownian motion or some other Wiener process,
spectral measure theory, Markov chains, …

If not, is a snap-shot of the cluster state sufficient for scheduling
and resource selection?

Sample communications data on the cluster:
Broadcast and Point-to-Point

Sample input/output data on the cluster:
Network based reads and writes
No return of data
Cpu waits on data after sending address

LFC_daemon: sample input/output data on the cluster
Network based reads and writes –alternative storage scheme (IBP)

Serial interface to the application routines mimics the calls
of LAPACK –also utility routines are provided which assist
the user in preparing his data if this is necessary:
/* data is in-core, retain the same interface as LAPACK in this case, returns -1 on error */
int LFC_dgesv(int n, int nrhs, double *a, int lda, int *ipiv, double *b, int ldb, int info);

/* data is out-of-core, returns -1 on error, expects paths to the data stored on disk */
int LFC_dgesv_file(int n, int nrhs, char *a, int lda, char *ipiv, char *b, int ldb, int info);

%gcc –Wall user.c –L/…/LFC –llfc …

Checks that the daemon is active and if not reactivates the
daemon and uses snap-shot data for the performance modeler
Retrieves the cluster state and passes this information to the
modeler/resource selector
The resource selector attempts to minimize a multi-variable
time function –methods such as an exhaustive search blow up
quickly as the number of candidate processors grows
-annealing, adhoc evaluation (GrADS), genetic algorithms,
dynamic linear programming (convex-hull sets) are all
possible methods (others…)
Provides expert tuning for the selected resources based on
the specific problem parameters –eg grid aspect ratio and block size
Creates the command line, forks, execs, waits

Sample block size optimization by the middleware …

Nprocs = 8
Expert tuning by the middleware:
Grid aspect ratio for parallel runs

Poor choice
of grid aspect
ratio
can lead to
unnecessarily
large time
delays …

Attempting to minimize a multi-parameter space which is changing
in time … grrr

Bigger …
Not always better

L0
U0
A2,2
A1,1
A2,1
A1,2
The current column of processors performs
LU on an m x nb panel of A (A11,A21), the
result is having L11, L21, U11 (factor phase)
Find pivot, perform swap, scale, and
broadcast the corresponding row column wise
Each process in the active process column broadcasts the same
pivot information rowwise to all columns of processors
Each process applies row interchanges to the left and right of
the current panel, and L11 is broadcast along the current row of
processes converting A12 into U12 (pdtrsm,U12:=(L11)^-1 A12)
L21 is broadcast rowwise across all columns of processes. The row
panel U12 is broadcast columnwise down all process rows. All
processes update their local portion of A22
(pdgemm,A22:=A22 - L21U12)
T = t_factor + t_broadcast + t_update

Sample of the predictive power of the adhoc modeler:
(application kernel only)

Stand alone rendition of relevant ScaLAPACK and BLACS
routines for supported applications kernels –LU,Cholesky, QR,
symmetric and general eigenvalue problems
After logical process grid is defined, the user’s data must be
brought in-core
Once the data is in place, the parallel application routine is
invoked
On return, the solution is backward mapped and the parallel
environment released

Data movement implementations

Use Fseek() and Fread() to read data from NFS (lack of scalability)
Use ScaLAPACK Redistribution routine (memory problem for big matrices)
Use Darray data type and collective I/O in MPI-2 (Scalable, a bug in MPICH)

Performance
of fread():

ROMIO performance:

Comparison of the methods:

Do experiments to collect performance data

Estimate parameters using statistical methods

Do run-time schedule based on performance models
Modeling the data movement –prediction?

Physicists from Rice recently (1 May) reported the observance of
experimentally crafted atomic solitons for the first time (9May, Nature)

Solitons are localized wave packets that maintain a constant
shape as they propagate –what about small distances

They may be used in ultra high speed optical communication
networks since they can carry data over great distances
without signal boosters

Reference BEC (cooling + magnetic confinement of atoms),
self-attraction balances the tendency for dispersal of the
wave packets

LFC methodology and application kernel -sample result:

I did tend to work alone doing work primarily in the evening when
I wouldn't be interrupted by other people. That lends itself to computer
design because its hard to do it as a group activity.
�
Seymour Cray
Interesting quote …
LAPACK for Clusters (LFC)
Kenneth J. Roche
The University of Tennessee,
Department of Computer Science
roche@cs.utk.edu, lfc@cs.utk.edu http://www.cs.utk.edu/~roche

Problems …

oversubscribed nodes at the onset ….

jobs running past I/o or synchronization points ….

bug in ROMIO darray identified and patched ….

I did tend to work alone doing work primarily in the evening when
I wouldn't be interrupted by other people. That lends itself to computer
design because it’s hard to do it as a group activity.

Seymour Cray
Interesting quote …
LAPACK for Clusters (LFC)
Kenneth J. Roche
The University of Tennessee,
Department of Computer Science
roche@cs.utk.edu, lfc@cs.utk.edu http://www.cs.utk.edu/~roche

Poor choice
of grid aspect
ratio
can lead to
unnecessarily
large time
delays …
HPL

Created with pptHtml