April 15th-17th, 2019 | Knoxville, TN
The workshop gathers leading researchers in high-performance computing from the JLESC partners INRIA, the University of Illinois, Argonne National Laboratory, Barcelona Supercomputing Center, Jülich Supercomputing Centre, RIKEN R-CCS and The University of Tennessee to explore the most recent and critical issues in advancing the field of HPC from petascale to the extreme scale era.
The workshop will feature sessions on these seven central topics:
In addition, dedicated sessions on computational fluid dynamics, computational biology and climate/weather research are planned.
A key objective of the workshop is to identify new research collaborations and establish a roadmap for their implementation.
The workshop is open to Illinois, INRIA, ANL, BSC, JSC, Riken R-CCS and UTK faculties, researchers, engineers and students who want to learn more about Post-Petascale / Pre-Exascale Computing.
Track 1 | Track 2 | Track 3 | |
07:30 | Registration & Breakfast | ||
08:30 | Opening Robert Speck, Yves Robert, Jack Dongarra and George Bosilca, UTK |
||
09:00 | Plenary: Delivering on the Exascale Computing Project Mission for the U.S. Department of Energy
Kothe, Douglas B (ORNL) Session chair: Jack Dongarra |
||
09:30 | |||
10:00 | Break | ||
10:30 |
Session chair: Atsushi Hori |
Session chair: Jakub Kurzak |
Session chair: Michela Taufer |
11:00 | |||
11:30 | |||
12:00 | Lunch | ||
12:30 | |||
13:00 | |||
13:30 |
session moved on Wednesday (PTM3.1) |
Session chair: Robert Speck |
Session organizer: Piotr Luszczek |
14:00 | |||
14:30 | |||
15:00 | Break | ||
15:30 |
Session chair: Daniel Katz |
Session chair: Gabriel Antoniu |
Session chair: Stan Tomov |
16:00 | |||
16:30 | |||
17:00 | |||
17:30 | Adjourn | ||
18:00 | |||
18:30 | |||
19:00 | Social Event Dinner at Calhoun's on the River (upstairs) |
Track 1 | Track 2 | Track 3 | |
07:30 | Breakfast | ||
08:30 | Plenary: Not Your Grandfather’s Tractor – How AI & IoT Are Transforming Production Agriculture Mark Moran (John Deere) Session chair: Brendan McGinty |
||
09:00 | |||
09:30 | Break | ||
10:00 |
Session chair: Hartwig Anzt |
Session chair: Rosa Badia |
ARM (more info) Session organizer: Mitsuhisa Sato
|
10:30 | |||
11:00 | |||
11:30 | |||
12:00 | Lunch | ||
12:30 | |||
13:00 | |||
13:30 |
Session chair: Christian Perez |
ARM (more info) Session organizer: Mitsuhisa Sato
|
|
14:00 | Session organizer: Kazutomo (Kaz) Yoshii | ||
14:30 | |||
15:00 | |||
15:30 | |||
16:00 | Poster Sessions @ Hilton Hotel, Hiwassee Room (Lobby level) | ||
16:30 | |||
17:00 | |||
17:30 | Social event: Riverboat Dinner and Celebration (meeting at the dock at 17:50, the boat leaves at 18:00 SHARP) | ||
18:00 | |||
18:30 | |||
19:00 |
Track 1 | Track 2 | Track 3 | |
07:30 | Breakfast | ||
08:30 | Plenary: Programming workflows for Advanced Cyberinfrastructure Platforms
Rosa Badia (BSC) Session chair: Franck Cappello |
||
09:00 | |||
09:30 | Break | ||
10:00 |
Session chair: Ruth Schoebel |
Session chair: Bogdan Nicolae |
Session organizer: Laxmikant (Sanjay) Kale |
10:30 | |||
11:00 | |||
11:30 | |||
12:00 | Lunch | ||
12:30 | |||
13:00 | |||
13:30 |
Session chair: Jon Calhouns |
Session chair: Ana Gainaru |
|
14:00 | |||
14:30 | |||
15:00 | Closing | ||
15:30 | |||
16:00 | |||
16:30 | |||
17:00 | |||
17:30 | |||
19:00 | Closing Dinner: Lonesome Dove |
Title | Presenter | |
---|---|---|
Not Your Grandfather’s Tractor – How AI & IoT Are Transforming Production Agriculture | Mark Moran John Deere | |
Abstract: After some background on the sweeping changes happening in agricultural technology as a result of General Purpose Technologies, Mark will discuss how AI and IoT are merging on the farm, creating some compelling problems to be solved on the edge. | ||
Programming workflows for Advanced Cyberinfrastructure Platforms | Rosa M. Badia Barcelona Supercomputing Center | |
Abstract: In the design of an advanced cyberinfrastructure platform (ACP), that involve sensors, edge devices, instruments, computing power in the cloud and HPC, a key aspect is how to describe the applications to be executed in such platform. Very often these applications are not standalone, but involve a set of sub-applications or steps composing a workflow. The scientists then rely on effective environments to describe their workflows and engines to manage them in complex infrastructures. COMPSs is a task-based programming model that enables the development of workflows that can be executed in parallel in distributed computing platform. The workflows that we currently support may involve different types of tasks, such as parallel simulations (MPI) or analytics (i.e., written in Python thanks to PyCOMPSs, the Python binding for COMPSs). COMPSs, through and storage interface, makes transparent the access to persistent data stored in key-value databases (Hecuba) or object-oriented distributed storage environments (dataClay). While COMPSs has been developed from its early times for highly distributed environments, we have been extending it to deal with more challenging environments, with edge devices and components in the fog, that can appear and disappear. | ||
Delivering on the Exascale Computing Project Mission for the U.S. Department of Energy | Douglas B. (Doug) Kothe Oak ridge National Laboratory | |
Abstract: The vision of the U.S. Department of Energy (DOE) Exascale Computing Project (ECP), initiated in 2016 as a formal DOE project executing through 2023, is to accelerate innovation with exascale simulation and data science solutions that enhance U.S. economic competitiveness, strengthen our national security, and change our quality of life. ECP’s mission is to deliver exascale-ready applications and solutions that address currently intractable problems of strategic importance and national interest; create and deploy an expanded and vertically integrated software stack on DOE HPC exascale and pre-exascale systems, thereby defining the enduring US exascale ecosystem; and leverage U.S. HPC vendor research activities and products into DOE HPC exascale systems. The project is a joint effort of two DOE programs: the Office of Science Advanced Scientific Computing Research Program and the National Nuclear Security Administration Advanced Simulation and Computing Program. ECP’s RD&D activities are carried out by over 100 teams of scientists and engineers from the DOE national laboratories, universities, and U.S. industry. These teams have been working together since the fall of 2016 on the development of applications, software technologies, and hardware technologies and architectures: Applications: Creating or enhancing the predictive capability of applications through algorithmic and software advances via co-design centers; targeted development of requirements-based models, algorithms, and methods; systematic improvement of exascale system readiness and utilization; and demonstration and assessment of effective software integration. Software Technologies: Developing and delivering a vertically integrated software stack containing advanced mathematical libraries, extreme-scale programming environments, development tools, visualization libraries, and the software infrastructure to support large-scale data management and data science for science and security applications. Hardware and Integration: Supporting U.S. HPC vendor R&D focused on innovative architectures for competitive exascale system designs; objectively evaluating hardware designs; deploying an integrated and continuously tested exascale software ecosystem at DOE HPC facilities; accelerating application readiness on targeted exascale architectures; and training on key ECP technologies to accelerate the software development cycle and optimize productivity of application and software developers. Illustrative examples will be given on how the ECP teams are delivering in these three areas of technical focus, with specific emphasis on recent work on the world’s #1 supercomputer, namely the Summit leadership system at Oak Ridge National Laboratory. |
Mark Moran is the Director of the John Deere Technology Innovation Center in Research Park at the University of Illinois, where he leads a team focused on digital innovation including UX, AI/ML, IoT, Mobile, Cloud and Autonomy. Mark has a BS in Engineering from the University of Illinois, an MBA from the University of Iowa, and SM in System Design and Management from MIT. He is currently pursuing his PhD at the University of Illinois, focusing on Cognitive Science and Natural Language Processing. Mark and his wife Lori have two college-aged daughters and four dogs.
Rosa M. Badia holds a PhD on Computer Science (1994) from the Technical University of Catalonia (UPC).
She is the manager of the Workflows and Distributed Computing research group at the Barcelona Supercomputing Center (BSC).
Her current research interest are programming models for complex platforms (from edge, fog, to Clouds and large HPC systems). The group led by Dr. Badia has been developing StarSs programming model for more than 10 years, with a high success in adoption by application developers. Currently the group focuses its efforts in PyCOMPSs/COMPSs, an instance of the programming model for distributed computing including Cloud. The group is extending the model to be able to consider edge devices that offload computing to the fog and to the cloud.
Dr Badia has published near 200 papers in international conferences and journals in the topics of her research. Her group is very active in projects funded by the European Commission and in contracts with industry.
Douglas B. Kothe (Doug) has over three decades of experience in conducting and leading applied R&D in computational science applications designed to simulate complex physical phenomena in the energy, defense, and manufacturing sectors. Doug is currently the Director of the U.S. Department of Energy (DOE) Exascale Computing Project (ECP). Prior to that, he was Deputy Associate Laboratory Director of the Computing and Computational Sciences Directorate (CCSD) at Oak Ridge National Laboratory (ORNL). Other prior positions for Doug at ORNL, where he has been since 2006, include Director of the Consortium for Advanced Simulation of Light Water Reactors, DOE’s first Energy Innovation Hub (2010-2015), and Director of Science at the National Center for Computational Sciences (2006-2010).
Before coming to ORNL, Doug spent 20 years at Los Alamos National Laboratory, where he held a number of technical and line and program management positions, with a common theme being the development and application of modeling and simulation technologies targeting multi-physics phenomena characterized in part by the presence of compressible or incompressible interfacial fluid flow. Doug also spent one year at Lawrence Livermore National Laboratory in the late 1980s as a physicist in defense sciences.
Doug holds a Bachelor in Science in Chemical Engineering from the University of Missouri – Columbia (1983) and a Masters in Science (1986) and Doctor of Philosophy (1987) in Nuclear Engineering from Purdue University.
Project: Optimization of Fault-Tolerance Strategies for Workflow Applications
This work compares the performance of different approaches to tolerate failures for applications executing on large-scale failure-prone platforms.
Project: Effective Use of Lossy Compression for Numerical Linear Algebra Resilience and Performance
Fault-tolerance is one of the major challenges of extreme-scale computing. There is broad consensus that future leadership-class machines will exhibit a substantially reduced mean-time-between-failure (MTBF) compared to today’s systems because the of the expected increase in the number of components without commensurately improving the per component reliability. The resilience challenge at scale is best summarized as, faults and failures are likely to become the norm rather than the exception: Any simulation run will be compromised without inclusion of resilience techniques into the underlying software stack and system. Therefore we investigate recovery approaches for partially lost data in iterative solvers using compressed checkpoints and data reconstruction techniques.
Multigrid methods are optimal iterative solvers for elliptic problems and are widely used as preconditioners or even standalone solvers. Multigrid methods provides the opportunity to use the underlying hierarchy to create compressed checkpoints in a intuitive way by restricting the iterative solution to coarser levels. A second lossy compression technique uses SZ, a specially designed floating-point lossy data compressor, and has been shown to be improve classical checkpointing approaches for the recovery of time-marching systems. SZ compression, unlike multigrid compression, allows users to prescribe accuracy targets and by this is more easily adaptable to the needs of the iterative solver. Using these two compression techniques, we evaluate their usability for restoring partially lost data, e.g. due to node-losses, of the iterative approximation during a linear solve. We compare different recovery approaches from simple value-wise replacement without post-processing up to solving local auxiliary problems and their efficiency. For the efficiency evaluation, we focus mainly on compression rate, numerical overhead and necessary communication. Furthermore, we investigate the checkpoint-frequency. Preliminary results show that there is no “Swiss Army Knife” which is optimal in all cases. Therefore, it is inevitable to adapt the methods with respect to the problem and the state of the iterative solver.
Project: Checkpoint/Restart of/from lossy state
Lossy compression algorithms are effective tools to reduce the size of HPC datasets. As established lossy compressors such as SZ and ZFP evolve they seek to improve the compression/decompression bandwidth and the compression ratio. As the the underlying algorithms of these compressors evolve, the spatial distribution of errors in the compressed data changes even for the same error bound and error bound type. Recent work has shown that properties of the simulation such as choice of boundary conditions, PDE properties, and domain geometry significantly impacts an application's ability to compute on state from lossy compressed data files. If HPC applications are to compute on data coming from compressed data files, we require an understanding of how the spatial distribution of error changes. This talk explores how spatial distributions of error, compression/decompression bandwidth, and compression ratio changes for HPC data sets from the applications PlasComCM and Nek5000 between various versions of SZ and ZFP. In addition, we explore how the spatial distribution of error impacts the correctness of the applications and the ability to create methodologies to recommend lossy compression error bounds.
Project: Checkpoint/Restart of/from lossy state
To solve large scale linear problems Krylov methods are among the most widely used iterative methods. Pipelined variations - which overlap communication and computations - have also been developed in order to handle exascale problems on HPC hardware. A downside of these methods is, however, the high amount of memory that is required to, among others, store a basis for the Krylov subspace.
Recently, interest has grown in the use of lossy compression techniques in order to reduce the I/O footprint of large scale computations when, for example, out-of-core calculations or checkpointing is used. These lossy compression techniques allow for much higher compression rates than normal compression techniques, while at the same time maintaining precise control on the error introduced by the compression algorithm.
On the other side, it has been shown that Krylov methods allow for some inexactness in the matrix vector product, it might be possible to combine them with these lossy compression techniques. In this talk we will explore this idea and show some preliminary results.
This is a joint work with with Emmanuel Agullo (Inria), Luc Giraud (Inria), Franck Cappelo (ANL)
References:
1. L. Giraud, S. Gratton, and J. Langou. Convergence in backward error of relaxed GMRES. SIAM J. Scientific Computing, 29(2):710–728, 2007.
2. V. Simoncini and D. B. Szyld. Theory of inexact Krylov subspace methods and applications to scientific computing. SIAM Journal Scientific Computing, 25:454–477, 2003.
3. D. Tao, S. Di, F. Cappello. A Novel Algorithm for Significantly Improving Lossy Compression of Scientific Data Sets, International Parallel and Distributed Processing Symposium (IEEE/ACM IPDPS), 2017.
4. J. van den Eshof and G. L. G. Sleijpen. Inexact Krylov subspace methods for linear systems. SIAM Journal on Matrix Analysis and Applications, 26(1):125–153, 2004
Project: HPC libraries for solving dense symmetric eigenvalue problems
One of the significant purposes of this project is to provide the JLESC community with an overview of pros and cons of existing software, and a perspective view of better adaptation to present and planned computers and accelerators. In this talk, we review the GPU eigensolvers and the performance benchmark results of four kernels (cuSolverDnDsyevd, magma_dsyevd, magma_dsyevdx_2stage, and eigen) with a single Titan-V (GV-100) acceleration. Also, some preliminary progress reports of ELPA on a GPU cluster will be presented.
Project: Scalability Enhancements to FMM for Molecular Dynamics Simulations
In order to use today's heterogeneous hardware efficiently, intra-node and inter-node parallelization is required.
Application developers can choose between fully integrated PGAS approaches hiding the distributed layout of the memory and MPI+X approaches separating shared and distributed memory communication.
The first is perfectly suited if the communication overhead of the transferred data is small compared to the attached computational effort.
In case more control over the message size and message timing is required, the latter approach provides better tuning possibilities.
However, using message passing directly in the application clutters up the code and mixes the often non-trivial communication algorithm with the application algorithm.
In this talk, we explore how to extend our tasking framework, which is used for a synchronization-critical Fast Multipole Method (FMM), to support distributed memory communication.
The tasking framework is based on C++11 and handles the efficient execution of different computational tasks arising in an FMM workflow.
Tasks are configured at compile-time via template meta-programming and mapped to ready-to-execute queues once all dependencies are met.
We show how this concept can be extended to communication tasks and provide preliminary results.
Project: Fast Integrators for Scalable Quantum Molecular Dynamics
Accurate modeling of excited electron dynamics in real materials via real-time time-dependent density functional theory (RT-TDDFT) relies on efficient integration of the time-dependent Kohn-Sham equations, a large system of non-linear PDEs. Traditional time-stepping algorithms like fourth-order Runge-Kutta require extremely short time steps to control error, leading to high computational cost and limiting the feasible system size and simulation time. We have accelerated the search for more efficient numerical integration schemes by interfacing the massively parallel Qbox/Qb@ll RT-TDDFT code with the PETSc library. In this talk, we compare the accuracy, stability, and overall merit of various time-steppers available within PETSc for RT-TDDFT problems, and we investigate the potential of adaptive time stepping to reduce time to solution.
Project: The ChASE library for large Hermitian eigenvalue problems
One of the most pressing bottlenecks of the original Jena VASP-BSE code is the solution of a large and dense Hermitian eigenvalue problem. While computing the Hamiltonian matrix itself is computationally expensive, there were two main issues linked to the solution of the eigenproblem: 1) the storage and reading of the matrix and 2) the computation of a small set of extremal eigenpairs. Thanks to an intense and fruitful collaboration we were able to restructure and parallelize the reading of the Hamiltonian matrix, and accelerate the computation of the desired eigenpairs making use of the ChASE library. The effort resulted in a tremendous performance improvement, and a drastic increase in strong scalability and parallel efficiency. In addition, the Jena BSE code can now access new regimes of k-point sampling density that lead to very large Hamiltonians and were previously inaccessible. The new results show how the use of a tailored eigensolver can extend the range of the domain simulation parameters paving the way to the study of new physical phenomena.
Project: Sharing and extension of OmpSs2 runtime for XMP 2.0 PGAS task model
Task parallelism is considered as one of the promising parallel programming paradigm to handle large irregular applications and many-core architecture such as the posk-K computer. R-CCS is developing a parallel programming model named XcalableMP (XMP) and its recent specification introduced task parallel programming model for distributed memory system. In this talk, we will provide the recent result of XMP 2.0 task model implemented by OmpSs2 task runtime shared by Barcelona Supercomputing Center.
Project: Shared Infrastructure for Source Transformation Automatic Differentiation
We discuss the self-adjoint shared memory parallelization (SSMP) and transposed forward-mode algorithmic differentiation (TFMAD) strategies for parallelizing reverse-mode derivative computation. We describe a prototype analysis tool and provide an overview of our plan for incorporating the analyses and transformations into the Tapenade algorithmic differentiation tool.
Project: Reducing Communication in Sparse Iterative and Direct Solvers
The talk will study on-node threading integration with iterative methods. In particular, how usage of non-blocking techniques at various scales performs. Threaded MPI comparison will be included to study improvements of Krylov space methods. The benefits for variants of Krylov methods is not spread equally. Finally, mixing of s-step methods with pipelining benefits performance and maintains stability. A quick note on programmer productivity from using threading and/or non-blocking is presented.
Project: Reducing Communication in Sparse Iterative and Direct Solvers
Enlarged Krylov subspace methods aim to tackle the poor scalability of classical Krylov subspace methods on large supercomputers by reducing the number of iterations to convergence. This reduction comes at the cost of a large increase in the computation required per iteration compared to that of classical Krylov methods. This talk will discuss the design of scalable solvers based on enlarged Krylov subspace methods, including the introduction of on-node parallelism for utilizing emerging supercomputer architectures, as well as preconditioning techniques to further reduce the iteration count.
Project: Developer tools for porting and tuning parallel applications on extreme-scale parallel systems
Developments in the partners' tools and their interoperability will be reported, along with their use with large-scale parallel applications. In particular, early experience with performance analysis of applications using the JURECA cluster+booster modular supercomputing architecture will be presented.
Project: Deep Memory Hierarchies
We study the feasibility of using Intel’s Processor Event-Based Sampling (PEBS) feature to record memory accesses by sampling at runtime and study the overhead at scale. We have implemented a custom PEBS driver in the IHK/-McKernel lightweight multi-kernel operating system, one of whose advantages is minimal system interference due to the lightweight kernel’s simple design compared to other OS kernels such as Linux.
Project: Towards accurate network utilization forecasting using portable MPI-level monitoring
In this talk we will present a low-level monitoring system implemented within Open MPI and how we leverage it to forecast network usage of the application.
Project: MPI International Survey
The preliminary result of the MPI International Survey will be presented.
Project: Reconfiguring Distributed Storage Systems on HPC infrastructures
Parallel storage systems are becoming an increasing bottleneck for HPC applications as their performance does not scale as fast as that of computing resources. A potential solution to this challenge is to use transient distributed storage systems that are deployed on the compute nodes only during the application's run time. Rescaling such a distributed storage system is a complex operation that can be beneficial in many situations: to adjust a system to its workload, to have a configuration fitted to each phase of a workflow, or to save core hours or energy. However, rescaling a distributed storage system necessarily involves numerous data transfers that are assumed to be too slow to be used at run time.
In this talk we present Pufferbench, a benchmark designed to estimate how long would a rescaling operation last in practice on a given platform.
Implementing an efficient rescaling mechanism in an actual distributed storage system is a bigger challenge. The load on each storage node should be balanced to avoid performance degradation from hotspots. The load, however, is not often linked to the amount of data stored per node, metric that should be balanced in order to have reliable and efficient rescaling operations. Optimizing both is a challenge that must be overcome to implement rescaling operations in distributed storage systems.
In the second part of this talk, we present early results with Pufferscale, a scheduler that organizes data transfers during rescaling operations and that ensures fast and reliable durations while balancing the load.
Project: Advancing Chameleon and Grid'5000 testbeds
Chameleon develops and operates a large-scale, deeply reconfigurable platform to support computer science systems research. Resources include a large homogenous partition comprising 600 Intel Haswell nodes as well as smaller clusters representing architectural diversity, including ARMs, Atoms, and low-power Xeons, SkyLakes, and hardware including SDN-enabled programmable switches, Infiniband connectors, large RAM, non-volatile memory, different types of SSDs, GPUs, and FPGAs. Chameleon supports bare metal reconfiguration, including booting from a custom kernel, network stitching and software-defined networking experiments. To date, Chameleon has served 3,000+ users working on 500+ research and education projects.
The Chameleon Phase 2 includes a range of new services including CHI-in-box, a packaging of the system allowing others to deploy similar testbeds, a range of new networking capabilities, and most importantly, new services such as Jupyter integration, new orchestration features, and Experiment Precis and that allow investigators to structure their experiments in a way that allows for easy repetition and recording, and meaningful sharing with experimental community. This talk will provide an overview of new services added to Chameleon in the last year, focusing on experimental techniques.
Project: Improving the Performance and Energy Efficiency of HPC Applications Using Autonomic Computing Techniques
Behaviour of HPC systems, such as performance, power consumption, thermal distribution, are increasingly hard to predict due to process variations and dynamic processors. Additionally computing facilities are now interested in limiting computing resources by power/energy budget rather than CPU time. We propose a software-level approach to provide control mechanisms that increase hardware power-limiting feature and work for multiple nodes by applying autonomic computing techniques and targeting HPC workloads. We adopt an approach involving Autonomic Computing and Control Theory. We present the identified technical issues, our approach, and first results.
Project: Towards Blob-Based Convergence Between HPC and Big Data
Blobs are arguably a solid storage model in a converging context between HPC and Big Data applications. Although there is a clear mapping between blobs and the storage systems used for batch data processing, streaming data is largely ignored by current state-of-the-art storage systems. Applications requiring processing of a stream of events have specific requirements that can span from in-network data processing, delivery guarantees or latency. Unfortunately, these feature lack in HPC environments. In this talk, we discuss the design of a large-scale logging framework based on converging Blob storage that helps fill the gap between HPC and Big Data applications relying on streaming data processing.
Project: Extreme-Scale Workflow Tools - Swift, Decaf, Damaris, FlowVR
Workflow systems promise scientists an automated end-to-end path from hypothesis to discovery. Expecting any single workflow system to deliver such a wide range of capabilities is impractical, however. A more practical solution is to compose the end-to-end workflow from more than one system. With this goal in mind, the integration of distributed and in situ workflows is explored, where the result is a hierarchical heterogeneous workflow composed of subworkflows, with different levels of the hierarchy using different programming, execution, and data models. In this talk, we present the results of our investigation together with the lessons learned from our integration which we hope that will increase understanding and motivate further research into heterogeneous workflow composition.
Project: Evaluating high-level programming models for FPGA platforms
Compared to central processing units (CPUs) and graphics processing units (GPUs) which have fixed architectures, field-programmable gate arrays (FPGAs) offer reconfigurability and promising performance and energy efficiency. With these FPGA-based heterogeneous computing systems, programming standards have emerged to facilitate transformation of algorithms from standard systems to heterogeneous systems. Open Computing Language (OpenCL) is a standard framework for writing programs that execute across various heterogeneous computing platforms. Investigating the characteristics of kernel applications with emerging OpenCL-to-FPGA development flows is important for researchers, who have little hardware development experience, to evaluate and adopt the FPGA-based heterogeneous programming model in a laboratory. In this talk, I will summarize the evaluations and optimizations of OpenCL kernels on an Arria10-based OpenCL FPGA platform. The kernels are derived from the streaming, integer and floating-point intensive, and proxy applications. The experimental results show that FPGAs are promising heterogeneous computing components for energy-efficient high-performance computing.
Project: Evaluating high-level programming models for FPGA platforms
FPGAs have already demonstrated their acceleration potential for several important workloads, particularly emerging workloads to HPC. However, due to its hardware mentality, no standard common FPGA abstraction layers, programming interfaces and runtime system exist, thus preventing the HPC community from building their own FPGA-based clusters. We are tackling to solve such issue by cultivating a FPGA-HPC community in order to discuss abstraction technique, standardization, identify benchmarking method and implementation, common programming model and environment, and so on. We have organized several successful workloads (some co-held with major conferences). We will talk about our community efforts, technical outcomes and next steps.
Project: Simplified Sustained System performance benchmark
We have been developing a new benchmark metric, Simplified Sustained System performance benchmark (SSSP). This talk, I'll introduce some new results on the Intel SKL and the Arm ThunderX2.
Project: Resource Management, Scheduling, and Fault-Tolerance for HPC Workflows
While the use of workflows for HPC is growing, MPI interoperability remains a challenge for workflow management systems. The MPI standard and/or its implementations provide a number of ways to build multiple-programs-multiple-data (MPMD) applications, but these have limited applicability to workflow-like applications. In this presentation, we will update the JLESC community on the status of the MPI Launch feature that is currently accessible as a prototype on clusters and Cray systems. We will focus on two current activities around this prototype, 1) coupling parallel applications for in situ data transfer workflows and 2) the effort for MPI standardization of these features.
Continue execution or interrupt and launch another task? How to take difficult scheduling decisions.
Initiate collaborations on scheduling stochastic tasks and workflows.
Task parallelism is omnipresent these days; whether in data mining or machine learning, for matrix factorization or even molecular dynamics (MD).
Despite the success of task parallelism on CPUs, there is currently no performant way to exploit the task parallelism of synchronization-critical algorithms on GPUs.
Due to this shortcoming, we aim to develop a tasking approach for GPU architectures.
Our use case is a fast multipole method for MD simulations.
Since the problem size in MD is typically small, we have to target strong scaling.
Hence, the application tends to be latency- and synchronization-critical.
Therefore, offloading as the classical programming model for GPUs is unfeasible.
In this short talk we provide our experience with the design and implementation of tasking as alternative programming model for GPUs using CUDA.
In this short talk, we reveal several pitfalls that occur when implementing a tasking framework for GPUs and hope for vivid discussions about eliminating them. There is a bunch of open questions regarding i.a. warp-synchronous deadlocks, weak memory consistency and hierarchical multi-producer multi-consumer queues.
I will attempt to use one talk to discuss 2 issues:
1) The use of Python-based workflows at extreme-scale. Based on work in Parsl (http://parsl-project.org/), we know we can run large numbers of tasks at large scale on HPC resources, as well as using the same programming model for multiple resources (HPC systems, clouds, etc.) Here, I am interested in finding ways to collaborate with other workflow system developers, either via common user-facing APIs or by using common resource-facing APIs, and underlying libraries, as well as finding potential Parsl users and working with them to both find and fill gaps in Parsl and to improve their applications.
2) Software sustainability. While researchers at national labs may be recognized for their software, this is problematic in academia. But in order for software to be sustained over more than the life of one project, the developers and maintainers need recognition aligned with their institution's mechanisms (e.g. hiring, promotion, tenure, etc.) I have been developing the concept of citation of software (https://doi.org/10.7717/peerj-cs.86 & https://www.force11.org/group/software-citation-implementation-working-group) for this purpose, and would like to talk about it, and understand what gaps it has in this extreme-scale environment, as well as collaborating with others interested in exploring it or other software sustainability mechanisms.
see above
Emerging HPC workflows whose execution times are stochastic and unpredictable pose new challenges and opportunities for scheduling. This work presents alternative algorithms to schedule stochastic jobs with the aim of optimizing the system and/or user level metrics. By leveraging traces of some typical neuroscience applications that exhibit such behavior, we show that traditional HPC schedulers are not suitable for these jobs and we demonstrate the effectiveness of the new scheduling algorithms with significant improvement in achievable performance.
Many open questions remain for scheduling stochastic jobs in HPC, including determining the optimal parallel execution of these jobs, and how to leverage checkpointing/restart to cope with the unpredictable behavior of these jobs. The latter question also connects naturally to the resilience of (either deterministic or stochastic) HPC workloads in faulty execution environments.
We consider irregular applications that have irregular patterns in data blocking and distribution, with a form of sparsity (e.g., irregular Cartesian tiling of a 2D matrix with block-sparsity). Scheduling computations on these data structures pose problems of load balancing and computational intensity that make overlapping communications by computations even harder. When targeting hybrid distributed large scale platforms with many GPUs per node, the support of runtime systems to schedule this execution helps but is not sufficient, and the algorithm needs to be adapted to the problem.
What is the proper design of performance modeling and heuristics to help the scheduling?; Can we achieve a purely scheduling-based approach (in opposition of a control flow embedded in the algorithm)?
A critical component of any high level task-based runtime system is the interface it offers. It is critical for two major reasons: users primarily interact with the runtime through it and the interface determines the constraint on the offered capabilities of the runtime. PaRSEC is an example of a high level task-based runtime system that offers multiple interfaces to developers. One such interface is Dynamic Task Discovery that allows PaRSEC to support applications exhibiting dynamic nature by building a task graph at runtime. In this talk we will present the scalability issue this interface suffers from and a way to address this issue.
How to efficiently build a task graph with minimal information from the users?
PaRSEC is a distributed task based runtime system that transparently manages the communication between nodes and across heterogeneous devices. By adopting asynchronous operations, we reduce runtime overhead and thus improve application performance. The Future construct has been proven as an effective asynchrony provider in many programming languages and runtime systems. In this talk I will present our current design for usage of Futures, preliminary results on some benchmarks and the ongoing PaRSEC optimization work using Futures. We believe that our generic approach can be broadly adopted by other systems.
Future life cycle management and it's potential use cases
In this work, we are interesting in modeling HPC applications for in situ analytics. The objective is to design robust application models for in situ analytics. Those models would give user a resource partitioning of the system as so as a scheduling of application tasks. The goal is to improve application performance and reduce overall cost by providing an efficient resource usage for an application running on a targeted platform.
Starting from a first model, we want now to evaluate this model on real-world application setup to determine 1) efficiency of our resource partitioning VS "by-hand" resource partitioning 2) efficiency of some proposed scheduling policies. The results would provide us new perspectives for future model enhancements.
OPEN QUESTIONS:
1) From a theoretical perspective, what is an HPC application and how to improve platform representation (burst buffers or NVRAM usage etc)? This includes the needs in terms of in situ analytics (task modeling, data locality, dependencies between analysis etc) as so as the general platform model.
2) How to design effective scheduling policies for analysis tasks? Can we guarantee a performance bound?
3) What are the strategies to deploy in order to evaluate such theoretical models on real-world application?
4) In general, what is the community expectation about such a work?
COLLABORATION OPPORTUNITIES:
A) We would like to test our models on real applications. Any collaboration for this purpose will be welcomed. We need experimental setup and applications to evaluate our models.
B) Any collaboration on the modeling part will be highly welcomed, including discussions on scheduling and storage issues.
Many problems remain to be solved in this work, and any collaboration or discussion on this work will be appreciated during the workshop.
Big Data applications are increasingly moving from batch-oriented execution models to stream-based models that enable them to extract valuable insights close to real-time. To support this model, an essential part of the streaming processing pipeline is data ingestion, i.e., the collection of data from various sources (sensors, NoSQL stores, filesystems, etc.) and their delivery for processing. Data ingestion needs to support high throughput, low latency and must scale to a large number of both data producers and consumers. Since the overall performance of the whole stream processing pipeline is limited by that of the ingestion phase, it is critical to satisfy these performance goals. However, state-of-art data ingestion systems such as Apache Kafka build on static stream partitioning and offset-based record access, trading performance for design simplicity. In this talk we introduce KerA, a data ingestion framework that alleviate the limitations of state-of-art thanks to a dynamic partitioning scheme and to lightweight indexing, thereby improving throughput, latency and scalability. Experimental evaluations show that KerA outperforms Kafka up to 4x for ingestion throughput and up to 5x for the overall stream processing throughput. Furthermore, they show that KerA is capable of delivering data fast enough to saturate the big data engine acting as the consumer.
How can approaches for scalable stream processing such as KerA be combined with in situ/in transit data processing architectures to address the needs of application scenarios combining HPC and data analytics?
More on resilience techniques.
Collaborate on resilience and scheduling problems.
In this presentation we will introduce different optimal algorithms for scheduling adjoint chains and adjoint multi-chains on general multi-level memory architectures. An extension to more general graphs is needed.
An extension to general adjint graphs of the presented algorithms can be a collaboration opportunity. In some context, the memory storing the checkpoints can fail and the need to design resilient versions of the presented algorithms arises.
Contrary to post-hoc MD data analytics that uses centralized data analysis (i.e., first generates and saves all the trajectory data to storage and then relies on post-simulation analysis), we extract MD data from the simulation as they are generated, analyze the data, and annotate MD outputs to drive the next steps in increasingly complex MD workflows. We extract data from the simulation by augmenting Plumed, a widely used open source MD analysis library that is designed to be a universal plugin for several popular MD codes. We integrate simulation and analytics on top of DataSpaces, an open source library that provides a shared-space abstraction for simulation and analytics applications to share data based on a tuple-space model. This strategy allows us to run structural analysis of MD frames agnostically, without modifications to the MD, data management (middleware), and analytics.
The modular design of our software framework allows us to mix and match different types of MD simulations, middleware, and analysis to build complex workflows. A near-to-optimal setting for simulation, middleware, and analytics can minimize data movement, execution time, and energy usage. However, such setting needs to be determined. This talk want to be a platform to brainstorm techniques and methodologies for the search of such a setting, independently from the type of molecular systems our MD simulations are studying.
1) Even though Plumed makes our in-situ analytics method broadly applicable to a variety of MD codes, the list of currently supported MD codes is not exhaustive. Are there other methods to extract data from MD simulations without recompiling the MD source code when data brokers such as Plumed are non-viable because of licensing or simply unavailable? In fact, we will kill two birds with one stone if an alternate approach works without a data broker, making our methods even more broadly applicable to other simulation paradigms which share a similar workflow, and differs only in the data schema.
2) Given a type of MD simulation, middleware, and analytics, is it possible to predict the near-optimal parameters using the performance data generated from this work?
3) Our work currently focuses on execution time and memory usage as performance metrics. Lack of fine grain control over energy measurement has limited our ability to use energy usage as a performance metric. How can we measure the energy usage associated with inter-node or intra-node data movement or in-situ analysis processes that may be operating on dedicated cores?
Contrary to post-hoc MD data analytics that uses centralized data analysis (i.e., first generates and saves all the trajectory data to storage and then relies on post-simulation analysis), we extract MD data from the simulation as they are generated, analyze the data, and annotate MD outputs to drive the next steps in increasingly complex MD workflows. We extract data from the simulation by augmenting Plumed, a widely used open source MD analysis library that is designed to be a universal plugin for several popular MD codes. We integrate simulation and analytics on top of DataSpaces, an open source library that provides a shared-space abstraction for simulation and analytics applications to share data based on a tuple-space model. This strategy allows us to run structural analysis of MD frames agnostically, without modifications to the MD, data management (middleware), and analytics.
The modular design of our software framework allows us to mix and match different types of MD simulations, middleware, and analysis to build complex workflows. A near-to-optimal setting for simulation, middleware, and analytics can minimize data movement, execution time, and energy usage. However, such setting needs to be determined. This talk want to be a platform to brainstorm techniques and methodologies for the search of such a setting, independently from the type of molecular systems our MD simulations are studying.
1) Even though Plumed makes our in-situ analytics method broadly applicable to a variety of MD codes, the list of currently supported MD codes is not exhaustive. Are there other methods to extract data from MD simulations without recompiling the MD source code when data brokers such as Plumed are non-viable because of licensing or simply unavailable? In fact, we will kill two birds with one stone if an alternate approach works without a data broker, making our methods even more broadly applicable to other simulation paradigms which share a similar workflow, and differs only in the data schema.
2) Given a type of MD simulation, middleware, and analytics, is it possible to predict the near-optimal parameters using the performance data generated from this work?
3) Our work currently focuses on execution time and memory usage as performance metrics. Lack of fine grain control over energy measurement has limited our ability to use energy usage as a performance metric. How can we measure the energy usage associated with inter-node or intra-node data movement or in-situ analysis processes that may be operating on dedicated cores?
The Holistic Measurement Driven System Assessment (HMDSA) project is designed to enable maximum science production for large-scale high performance computing (HPC) facilities, independent of major component vendor, and within budget constraints of money, space, and power. We accomplish this through development and deployment of scalable, platform-independent, open-source tools and techniques for monitoring, coupled with runtime analysis and feedback, which enables highly efficient HPC system operation and usage and also informs future system improvements.
We take a holistic approach through
* Monitoring of all performance impacting information (e.g., atmospheric conditions, physical plant, HPC system components, application resource utilization) with flexible frameworks for variable fidelity data collection to minimize application and/or system overhead,
* Developing scalable storage, retrieval, and run-time analytics to provide identification of performance impacting behaviors,
* Developing feedback and problem (e.g., faults, resource depletion, contention) mitigation strategies and mechanisms targeting applications, system software, hardware, and users.
How to handle immense amounts of data, flexible storage and access methods, application of intelligent agents to help assess system and application performance without needing to explicitly profile and/or instrument applications.
HPC systems have seen tremendous improvements in computational power, advances unmatched by communication capabilities, leading to a substantial imbalance between computation and communication power. Thus, data movement has become a new bottleneck in some large scale applications. With this in mind, It is essential for communication engine such as MPI to perform as efficiently as possible. This poster focuses on the study and various improvements for multithread communication and collective operations in Open MPI and how it performs in the modern HPC systems.
Advantage of the new scheme, how difficult to adopt it and what kind of application will benefit from this?
I will present an implementation of GPU convolution that favors coalesced accesses. Convolutions are the core operation of deep learning applications based on convolutional neural networks. Current GPU architectures are typically used for training deep CNNs, but some state-of-the-art implementations are inefficient for some commonly used network configurations. I will discuss experiments that used our new implementation, which yielded notable performance improvements — including up to 2.29X speedups — in a wide range of common CNN configurations.
I will be asking for potential collaborators willing to try our implementations in their CNNs.
I will be also interested in hearing from CNN architectures experiencing poor performance to devise new optimization opportunities.
The presentation will give overview of data-Sparse problems and sparse formats in the context of graphs, databases, and FEM methods. Together with the algorithm design and performance engineering for sparse operations on GPUs with the Ginkgo open source linear algebra library and its software sustainability aspects.
Optimal data-sparse methods and sparse formats unification: graphs, databases, and FEM? Productivity aspects in algorithm design for optimal performance in engineering sparse operations on GPUs? Unified GPU programming solutions between CUDA, OpenMP, and OpenACC? Is CUDA Managed Memory for sparse solvers?
Classification schemes for Polar Stratospheric Clouds will be presented. Methods used for feature reduction include autoencoder and kernel PCA have been used. Comparison with previous results will be provided to assess the prediction performances.
Numerical computation efficiency of the algorithm: what can be done?
High-end computing relies increasingly on machines with large numbers of GPUs. At the same time, GPU kernels are commonly produced in the process of automated software tuning. The Benchtesting OpeN Software Autotuning Infrastructure (BONSAI) project provides a software infrastructure for deploying large performance tuning sweeps to supercomputers. BONSAI allows for parallel compilation and benchmarking of a large number of GPU kernels by dynamically scheduling work to a large numbers of distributed-memory nodes, with multiple GPU accelerators. In this talk we outline BONSAI design and highlight its main capabilities.
collaboration opportunities:
* GPU kernel development - use BONSAI for making large GPU kernel autotuning sweeps
* machine learning - use large datasets produced by BONSAI tuning sweeps
Within the Helmholtz Analytics Framework there is a need for the analysis of data from large to extreme sizes. The current tools for this are run on either high performance computing (HPC) systems or graphics processing unit (GPU) based systems. While there are tools intended for the use of both of these systems concurrently, their communication methods are not designed for HPC environments. The goal of the Helmholtz Analytics Toolkit (HeAT) is to fill this gap.
The HeAT framework is being developed to use both HPCs and GPUs with MPI communication to analyze extremely large datasets. It is build on the concept of using multiple linked PyTorch tensor objects on multiple nodes to distributed the data as well as the computations. In the future, the HeAT framework will include multiple machine learning algorithms as well as multiple traditional data analysis tools.
What are some possibilities for efficient eigenvalue solvers for distributed datasets?
The data is not split into blocks here but rather split along one dimension, can a new parallel matrix multiplication method be divised for this which is less communication intensive than the more traditional approaches?
DeepHyper is a Python package that comprises two components: 1) Neural architecture search is an approach for automatically searching for high-performing the deep neural network architecture. 2) Hyperparameter search is an approach for automatically searching for high-performing hyperparameters for a given deep neural network. DeepHyper provides an infrastructure that targets experimental research in neural architecture and hyperparameter search methods, scalability, and portability across HPC systems. It comprises three modules: benchmarks, a collection of extensible and diverse DL hyperparameter search problems; search, a set of search algorithms for DL hyperparameter search; and evaluators, a common interface for evaluating hyperparameter configurations on HPC platforms.
Machine learning applications
The talk will discuss the ongoing work at Jülich that aims at bridging HPC, CFD, and Rhinology to realize HPC-supported personalized medicine.
In-situ computational steering; Interactive supercomputing; Realistic real-time CFD simulations
We will present MagmaDNN - a high-performance data analytics library for manycore GPUs and CPUs. MagmaDNN is a collection of high-performance linear algebra for deep learning network computations and data analytics.
MagmaDNN is open source so we need collaborators for the development. Is this just another new library for DNN - or can we bring more in terms of HPC. Current frameworks are not that well optimized, while the core of the computations needed for AI, DNN, and big-data analytics are linear algebra. As experts in the field, can we collect the routines needed in a framework and find applications to use it? Besides development and high-performance, open research questions are how to design DNNs, tune hyperparameters, accelerated solvers, how to make sense of how they work, etc.
The objective of the Software for Linear Algebra Targeting Exascale (SLATE) project is to provide fundamental dense linear algebra capabilities to the US Department of Energy and to the high-performance computing (HPC) community at large, and ultimately to replace the venerable Scalable Linear Algebra PACKage (ScaLAPACK). SLATE is being developed from the ground up, with focus on scalability and support for hardware accelerators. This talk highlights SLATE's main design principles.
collaboration opportunities:
* science apps - implements building blocks for many science apps
* benchmarking - contains a GPU-accelerated implementation of the HPL benchmark
* scheduling research - many routines make good mini-apps for scheduling research
We will present some recently developed mixed-precision solvers that use FP16 arithmetic. Current hardware, e.g., GPUs with Tensor Cores, started supporting accelerated arithmetic for low precision arithmetic for use in AI applications. This hardware is readily available in current extreme-scale systems (like Summit) and is of interest to use it in general solvers and applications, beyond deep learning networks.
Are there other mixed-precision algorithms of interest that can be collected in a mixed-precision numerical library? What applications need and can benefit from mixed-precision computations? Collaboration is needed for the development of software and research questions related to accuracy and performance.
The explosion of hardware-parallelism inside a single node asks for a shift in the programming paradigms and disruptively-different algorithm designs that allow to exploit the compute power available in new hardware technology. We propose a parallel algorithm for computing a threshold incomplete LU (ILU) factorization. The main idea is to interleave an element-parallel fixed-point iteration that approximates an incomplete factorization for a given sparsity pattern with a procedure that adjusts the pattern to the problem characteristics. We describe and test a strategy for identifying nonzeros to be added and nonzeros to be removed from the sparsity pattern. The resulting pattern may be different and more effective than that of existing threshold ILU algorithms. Also in contrast to other parallel threshold ILU algorithms, much of the new algorithm has fine-grained parallelism.
Optimal precision for FEM? Sparsity pattern exploitation and its limits? Optimal variant of ILU schemes?
Over the last years, we have observed a growing mismatch between the arithmetic performance of processors in terms of the number of floating point operations per second (FLOPS) on the one side, and the memory performance in terms of how fast data can be brought into the computational elements (memory bandwidth) on the other side. As a result, more and more applications can utilize only a fraction of the available compute power as they are waiting for the required data. With memory operations being the primary energy consumer, data access is pivotal also in the resource balance and the battery life of mobile devices. In this talk we will introduce a disruptive paradigm change with respect to how scientific data is stored and processed in computing applications. The goal is to 1) radically decouple the data storage format from the processing format; 2) design a "modular precision ecosystem'' that allows for more flexibility in terms of customized data access; 3) develop algorithms and applications that dynamically adapt data access accuracy to the numerical requirements.
Optimal precision for iterative solvers? A priori and a posteriori precision selection.
Mesh partitioning is a potential bottleneck on extreme scale systems. Moreover static partitions evaluated a priory can become inefficient on large and heterogeneous systems with rather unpredictable performance, specially for complex simulations with heterogeneity on the discretization or on the physics evolution across the domain. We have been working on dynamic mesh partition optimization and we have tested it on different heterogeneous systems such as Piz Daint from CSCS or the CTE P9 MareNostrum IV cluster form the BSC. We aim to share our experience and we seek for collaboration on the solution of different issues regarding parallel partitioning, load balancing or performance implications of the partition type (graph based / geometrical based) on the performance of the different phases of the simulation code.
We are interested on scalable mesh partitioning; comparison of different partitioners; definition and convenient partitions in heterogeneous systems; trade off between load balancing and communication reduction; impact of the partition on the linear solvers.
The contributions of this short talk are in the growing high performance computing (HPC) field of data analytics and are at the cross-section of empirical collection of performance results and the rigorous, reproducible methodology for their collection. Our work in the field of characterizing power and performance in data-intensive applications using MapReduce over MPI expands traditional metrics such as execution times to include metrics such as power usage and energy usage associated with data analytics and data management. We move away from the traditional compute-intensive workflows towards data-intensive workloads with a focus on MapReduce programming models as they gain momentum in the HPC community. The talk focuses on the quantitative evaluation of performance and power usage over time in data-intensive applications that use MapReduce over MPI. We identify ideal conditions for execution of our mini-applications in terms of (1) dataset characteristics (e.g., unique words in datasets); (2) system characteristics (e.g., KNL and KNM); and (3) implementation of the MapReduce programming model (e.g., impact of various optimizations). Preliminary results presented in the talk illustrate the high power utilization and runtime costs of data management on HPC architecture.
Open questions we would like to discuss at the workshops are as follows:
(1) how far are our observations from a general principle relating power cap and performance in data-intensive applications?
(2) is there any way of reducing data movement (and power usage) other than the combiner techniques?
(3) how can we tune the settings of the underlying MapReduce framework during runtime to extend identified "sweet spot" regions (i.e., regions of minimum runtime and power usage)?
The EU H2020 Centre of Excellence POP (Performance Optimisation and Productivity), with both BSC and JSC as partners, got more funding to operate for additional 3 years (Dec 2018 to Nov 2021). We will quickly highlight important new aspects of the CoE relevant to JLESC, namely the work on standard performance assessment metrics and a associated methodology and the co-design data repository.
Contributions/collaboration on both the standard performance assessment metrics /methodology and the co-design data repository
Modern CPUs offer a plethora of different native events for monitoring hardware behavior. Some map readily to concepts that are easily understood by performance analysts. Many others however, involve esoteric micro-architectural details, making it difficult even for performance experts to fully understand how to take advantage of them, or which ones to use for measuring desired behaviors and exposing pathological cases. In this talk we will outline our work that aims to shed light on these obscure corners of performance analysis.
Which part of the architecture do you think is the most important to monitor in order to assess the performance of your code, and which events would you use to do so?
In this talk, I will be discussing the capabilities, usage, and application of Software-based Performance Counters (SPCs) within Open MPI. These SPCs expose otherwise inaccessible internal Open MPI metrics through performance variables in the MPI Tools Information Interface. Enabling SPCs in Open MPI adds minimal overhead to MPI applications and can provide lower level information than the existing user-level PMPI interface. I will illustrate how these counters allow users to identify performance bottlenecks and areas for improvement in both user code and the MPI implementation itself.
This work lends itself to collaboration with performance tool developers and application developers looking to analyze the performance of the MPI portions of their code.
In this talk, I will present our design space exploration study that considers the most relevant architecture design trends we are observing today in HPC systems. I will discuss performance and power trade-offs, when targeting different HPC workloads, and my take about the main issues for the advance of academic research in this area.
The design space of HPC architectures is widening: accelerators, high number of cores and nodes, new memory technologies. How do we integrate all the necessary pieces to simulate such systems at reasonable speed and accuracy?
We are exploring next generation high performance computing architecture for Post-Moore Era. We are also trying to develop a methodology to estimate and analyze performance of such guture generation architectures.
Development of Custom Computing System with the-state-of-the-art FPGAs
What kind of computing kernels and communication/synchronization should we offload to a tightly-coupled FPGA cluster? We are looking for such killer computing and communication for FPGAs!
Exascale and beyong architectures are expected to become more complex, exhibiting a deepening memory hierarchy and more complex topologies.
Executing HPC applications on such platforms will require careful allocation of resources. In particular, advanced workloads like ensemble simulations or coupled codes might also want to use either co-scheduling or dynamic resource management for better efficiency.
To address this challenge, we are exploring the use of dynamic resource control schemes to map complex workloads unto a node, using performance monitoring to dynamically redistribute resources among components. For resource partitioning and allocation scheme, we use a special kind of containers called "resource slices".
These slices do not perform namespacing or virtualization, but only take care of resource control, using the available operating system interfaces on Linux (cgroups, resctrl). So far, we can perform CPU and NUMA node allocation, as well as cache bandwidth control.
While we have some early users with ensemble workloads, we are looking for more collaborators willing to explore:
- other types of workloads,
- workload-specific resource allocation schemes,
- extensions of our resource slices to other resources.
This work explores the potential usage of hybrid Cloud+Edge+HPC architectures for data analytics. We focus on the case of Pl@ntNet, an application that uses ML to identify and classify plants based on their pictures. Users have an active role in classification, as they need to validate the result proposed by the classification algorithms, while new expert-validated recognitions serve to continuously improve the underlying training mechanism. We propose an approach based on decentralizing part of the necessary computation necessary to classify plants from central processing (cloud) to the edge of the infrastructure. Namely, we investigate how local feature extraction and local fake image detection could improve the general application performance. In addition, we tackle the problem of reducing the time taken to train the neural network used by the classifier, so that new plant species identified by the users could be added faster to the ML model using HPC capabilities.
There are 2 open questions:
1- What are the gains in terms of performance (classification latency) and data transmission costs obtained by performing part of the classification locally?
2- How resource-consuming is neural network continuous learning? Would it be possible to perform part of the processing on the Edge of the infrastructure?
Over the last decades, distributing high performance applications has been made easier thanks to outstanding advances in programming languages, runtime systems, parallel libraries, and load balancing algorithms. But even if it has been made easier, it still remains, for the most part, a nightmare. In this talk I will discuss how the problem can be attacked, learning from existing practices and leveraging new programming approaches. In particular, as MPI and OpenMP have proven that standardization of interfaces is a successfull approach, I will discuss the possibility of standardizing load balancing abstractions. I will present the implications in terms of software architecture, and how such an effort could benefit the entire HPC community, from application writers to algorithm developers.
How to abstract load balancing from runtime systems and applications? How to express load balancing abstractions as types? How to leverage concept-based programming to design a load balancing library? Could C++ executors be used to provide load balancing capabilities?
Spatial soil moisture data are relevant to environmental sciences (e.g., ecological niche modeling, carbon monitoring systems, and other Earth system models) and precision agriculture (e.g., optimizing irrigation practices and other land management decisions). The primary source of soil moisture data over large areas is satellite-borne, radar-based remote sensing technology. Though produced with daily measurements, a major downfall of satellite soil moisture datasets is their coarse resolution, often too coarse for local needs (e.g., precision agriculture). To remedy this, we are leveraging machine-learning techniques coupled with other sources of environmental information (e.g., topography or weather) that is related to soil moisture and available at a finer spatial resolution.
Our collaboration between computer scientists at the University of Tennessee and soil scientists at the University of Delaware is developing SOMOSPIE—a modular SOil MOisture SPatial Inference Engine—for generating soil moisture information at finer resolution than available from satellite data. The engine consists of modular stages for processing spatial data, generating models with machine learning techniques, and analyzing the resulting predictions. The initial stage curates the available remotely sensed soil moisture measurements and ancillary environmental data for the desired temporal and geographic region. For the central stage, we are utilizing traditional methods such as k-nearest neighbors and random forest regression, as well as novel techniques such as HYbrid Piecewise POlynomials (HYPPO). Finally, networks of ground sensors provide us with “ground truth†by which to validate both our soil moisture predictions and the methods by which we produce the predictions.
(1) Are there other areas with a similar objective--downscaling one dataset using related data available at a higher resolution--for which existing research efforts and our efforts may be mutually informative?
(2) Aside from computing basic statistics (e.g., correlation) between soil moisture and related variables, what computational methods exist for identifying relationships that can improve model generation?
(3) There are inherent mathematical limitations to predicting soil moisture at a specific point using a summary value for a surrounding area (i.e., a large pixel from coarse, remotely sensed data). Might iterative bootstrapping methods reduce the influence of the coarseness of source data?
We will present challenges and some current results in the development of high-performance FFT library for large scale heterogeneous systems. Our goal is to provide a sustainable high-performance FFT library for Exascale platforms that leverages the large investments in FFT software by the broader HPC community.
Of particular interest are links to applications - what applications use FFTs, how, and can we use application-specific optimizations, while still providing a library with consistent API. Furthermore, FFT has particular computational motifs that can be used elsewhere. Of interest is high-performance MPI for all-to-all using GPU direct communications. These are used in global "matrix transpositions". Are there other applications that need these building blocks and is it of interest to expose them to users with proper user interface?
Determining the presence of “galaxies” in an n-body simulation is usually based on some form of the Friends-of-Friends algorithm, which connects close groups of particles together into single components. This algorithm can be divided into three distinct phases: (a) Identifying the connected components, (b) Component labelling, and (c) Pruning insignificant components. Identifying connected components of a graph is a well researched problem, and can either be solved by using a graph traversal algorithm or by employing disjoint-set data structures. Distributed parallel versions of Breadth First (BFS) or Depth First Search (DFS) can be used to efficiently traverse a single component. However, the parallel BFS needs to be executed once per connected component, and these executions cannot be overlapped. We explore the design of a fully distributed asynchronous union-find algorithm where the edge set is processed in parallel across processes without any barriers across them.
As the number of vertices we process for astrophysical simulations are easily a billion plus, the number of union-find operations, and in turn the number of messages in flight can easily clog the network hurting the performance, and in many cases running out of memory. Hence, it will be critical to explore an effective way to throttle generation/identification of edges, e.g. via batching strategies. However, design of such strategies is challenging because it faces twin and opposite dangers of starving processors of work or swamping their memories, when the decision criteria is spread over all the processors. We develop effective strategies for this purpose in our work, and present results of this work-in-progress. The algorithm is general-purpose and has application in domains such as social network analsysis and other graph problems Collaborations from applications as well as computer scientists are welcome.
Although single-node shared memory solutions have been explored, as soon as the problem gets large enough not to fit on a single node, it becomes very difficult to get good performance. Some of the challenges include very fine-grained messaging, tradeoff between shared memory and message-based programming costs, load imbalances, and prioritization and scheduling among multiple competing computations in an asynchronous setting.
Modular Supercomputing has been recently proposed as an alternative design approach to address open challenges of the current computing cluster paradigm such as power consumption, resiliency, and concurrency. One available machine putting this Moduler Supercomputing design approach into practice is using a Cluster-Booster setup. The Cluster nodes have less cores, but higher frequencies and are used for the less scalable part of an application. While the Booster nodes have a higher core count, but slower frequencies and are intended for the more scalable part of the application. Since Modular Supercomputing may be one of the possible solutions towards Exascale-enabled HPC Systems, it is important to understand which would be the typical applications benefiting from a modular design. In this regards, there are several important open questions: 1) how should one partition an existing parallel application so that each part is executed across distinct modules? 2) Is it possible to predict which part of an application should run on which module? Further is it possible to create a feature-based prediction for the partitioning? These features include the scalability, the memory footprint, and the amount of data that is needed to be exchanged between different parts of the application. One clue element common to all questions is to carry on software optimizations specifically aimed to modular execution but that are still performance portable. One of the most important aspects of such optimizations is the performance on the communication between the different modules. Topology-aware MPI collectives are proposed to fasten the communication between modules by using topology information that are available at run time.
- General Guidelines for porting applications for Modular Supercomputing
- Feature-based partitioning prediction
- Optimizing MPI collectives using topology-awareness
As MPI comes to term with shared memory in presence of large multicore nodes, one approach for effective parallel programming without resorting to multi-model programming represented by MPI+X approaches is to support multiple ranks within a single process. This is useful even in conjunction with MPI + X programming, as indicated by the partial support in the community for the “endpoints” proposal. In Adaptive MPI, developed in our research group at Illinois, virtualization and overdecomposition are necessary for the purpose of supporting key features such as dynamic load balancing, malleability and fault tolerance, in addition to adaptive overlapping of communication and computation. The problem then becomes how to support multiple ranks,each of which looks like a logical process to the programmer, in a single physical process, which allows sharing of memory in effective ways among ranks. Additional challenges arise when we wish to allow migration of these ranks across physical hosts. We will enumerate multiple issues and challenges that arise in this context and multiple approaches that are being explored. These include the old isomalloc approach for migration developed in France and in regular use in AMPI, the process-in-process approach currently being explored by Prof. Hori and collaborators, as well as a set of techniques being explored for Adaptive MPI. I will present these and seek ideas, synergies and collaborations among JLESC researchers.
Esoteric system level issues arise in supporting multiple virtual ranks within a process. There are compiler-level issues that need unified approaches to privatization of a correct subset of global variables. OS support regarding MMAP and virtual memory reservations come up in trying to support migration. We are hoping that a comprehensive review of these challenges may generate solutions and suggestions from the broader JLESC community.
We evaluated the performance and the power consumption for multiple vector lengths on some benchmarks.
We will evaluate the effect of vector length in the performance and the energy consumption.
Tasks are a good support for composition. During the development of a high-level component model for HPC, we have experimented to manage parallelism from components using OpenMP tasks. Since version 4-0, the standard proposes a model with dependent tasks that seems very attractive because it enables the description of dependencies between tasks generated by different components without breaking maintainability constraints such as separation of concerns. The paper presents our feedback on using OpenMP in our context. We discover that our main issues are a too coarse task granularity for our expected performance on classical OpenMP runtimes, and a harmful task throttling heuristic counter-productive for our applications. We present a completion time break-down of task management in the Intel OpenMP runtime and propose extensions evaluated on a testbed application coming from the Gysela application in plasma physics.
OpenMP runtime limitations wrt task granularity. Higher level data flow model than just plain task graphs.
With the objective of extracting as much performance as possible from the exascale machines, the traditional mix of paradigms for shared memory, distributed memory and accelerators will struggle to achieve near-peak performance. Dataflow task-based runtime systems are a natural solution; abstracting architecture specific APIs, and removing excessive synchronizations, while proposing different domain specific interfaces. Programming within such a model can prove to be a challenge, and even have inherent scalability issues. We will explore the possibility of having an easy to program middle-ground paradigm that delivers enough information to the runtime to mitigate scaling issues?
Exchange ideas with runtimes and applications on what information can be shared with the domain specific interface to help the runtime improve its scheduling of tasks and communication.
Current communication libraries do not match the needs of asynchronous task models, such as PaRSEC or Legion and of graph analytic frameworks; also, they cannot take good advantage of forthcoming smart NICs. We present our current design for LCI, a Lightweight Communication Library and current work to use it to support PaRSEC.
feedback on current design will be useful. We have started (currently unfunded) collaboration with George Bosilca at UTK and propose t o make this collaboration a JLESC project.
Starting from current generation and forward to next generations, compute nodes embed a complex addressable memory hierarchy. In order to extract performance from this memory, it is required to optimize data locality through wise data allocation and on point data migrations. Though it is not desired to expose such a complexity to final application developers, building blocks for operating the memory hierarchy will need to be exposed with a consistent and convenient interface to enable portable and efficient runtime optimizations. AML memory library is being developed as a part of Argo project, funded by the Exascale Computing Project, to define and implement such building blocks. So far, it is envisioned in the library that explicit memory management can be done through three main components:
cross memory/device data migration,
explicit userdata layout and userdata access patterns,
and hardware locality.
While the library is under development and integration into real world applications, I am looking for collaborations to design runtime optimizations enabled by these blocks, such as automatic prefetching or data packing.
Hopefully such building blocks will enable more extensive works on existing open questions such as:
** Optimal static allocation under capacity, bandwidth and latency constraints.
** Automatic data migration (to fast scratchpads or closer memory).
** Coupled management of threads and data ...
Qu est-ce qu'on a comme leviers pour encourager les collaborations a part un discourt convaincant?
SIMD operations on irregular data structures may be difficult if the computational operations differ between the data elements. But often the compute kernel is repeatedly called for many different inputs and could easily take advantage of SIMD operations if the input is provided in AoSoA format.
We will show how a list of arrays can be `transposed' (either on-the-fly or as a copy operation) into AoSoA format (and vice versa) using SIMD shuffle operations within C++. This approach may be helpful to accelerate some kernels if (permanently) changing the underlying memory layout is not possible or beneficial.
Which groups are facing similar problems?
Could they benefit from having this available as a separate library?
Which other SIMD architectures are of interest?
In this talk we present the Task-Aware MPI (TAMPI) library that extends the functionality of standard MPI libraries by providing new mechanisms for improving the interoperability between parallel task-based programming models, such as OpenMP or OmpSs-2, and both blocking and non-blocking MPI operations. By following the MPI Standard, programmers must pay close attention to avoid deadlocks that may occur in hybrid applications (e.g., MPI+OpenMP) where MPI calls take place inside tasks. This is given by the out-of-order execution of tasks that consequently alter the execution order of the enclosed MPI calls. The TAMPI library ensures a deadlock-free execution of such hybrid applications by implementing a cooperation mechanism between the MPI library and the parallel task-based runtime system.
TAMPI supports two different modes. The blocking mode targets the efficient and safe execution of blocking MPI operations (e.g., MPI_Recv) from inside tasks, while the non-blocking mode focuses on the efficient execution of non-blocking or immediate MPI operations (e.g., MPI_Irecv), also from inside tasks.
We are looking for pure or hybrid MPI apps or mini-apps to test our TAMPI library and get feedback to improve it.
Online black-box optimization of the power-performance tradeoff via hardware actuators/sensors and lightweight application instrumentation.
What are the possible gains along the power-performance curve on production applications? Which feedback is useful, and what actuators have the most impact? If sufficient gains are possible, can they be achieved by an online control policy? We are currently working with Intel architectures, which other platforms could such work apply to?
This briefly introduces several use cases where checkpointing techniques can help capture critical application data structures for later re-use. It discusses both the specific requirements and constraints that checkpointing has in these use cases, as well as the potential benefits
What are the key requirements for checkpointing when it is used beyond resilience? How do checkpointing techniques designed for resilience need to change to meet these requirements and optimize performance and scalability? If your group needs checkpointing beyond resilience, the speak would be happy to lean more about the specific use case and collaboration opportunities.
As supercomputers have grown in scale to meet computing demands, their mean time between failure has declined. MPI is evolving to include concepts that enable continued operation despite failures. A recent push toward non-blocking, configurable, and asynchronous recovery tries to address issues with composing modular recovery procedures and amortize MPI repair cost with other recovery aspects. The goal is to enable application components to control the scope and timing at which errors are reported and permit an asynchronous MPI recovery operation where multiple components recovery procedures can overlap. Recent advances have demonstrated the feasibility of the approach, and open a new landscape for resilient algorithms and application demonstrators that can operate in an asynchronous manner.
Opportunities for developing non-blocking recovery algorithms and global scope error reporting (implementation level) and application use cases.
As HPC systems grow larger and include more hardware components of different types, the system's failure rate becomes higher. Efficient fault tolerance techniques are essential not only to ensure the execution completion but also to save energy. In many cases, failures have a localized scope and their impact is restricted to a subset of the resources being used. In MPI applications, combining Checkpoint/Restart and message logging enables the localized rollback recovery of only the processes affected by a failure, which heavily reduces the recovery overhead. Using MPI remote memory access operations and performing a custom replay of collective operations lowers the synchronicity of the replay and can contribute towards minimizing the overall failure overhead and energy consumption.
The open questions will be focused on how to enable a receiver-driven replay of communications in MPI applications.
Deep neural networks (DNNs) have been quickly and broadly exploited to improve the data analysis quality (such as classification accuracy) in many complex science and engineering applications. Today’s DNNs are becoming deeper and wider because of increasing demand on the analysis quality and more and more complex applications to resolve. The wide and deep DNNs, however, require large amounts of resources (such as memory, storage, and I/O), significantly restricting their utilization on resource-constrained systems. We propose DeepSZ: an accuracy-loss bounded neural network compression framework, which involves four key steps: network pruning, error bound assessment, optimization for error bound configuration, and compressed model generation, featuring a high compression ratio and low encoding time. Experiments show that DeepSZ can compress AlexNet and VGG-16 on the ImageNet dataset by a compression ratio of 46× and 116×, respectively, and compress LeNet-300-100 and LeNet-5 on the MNIST dataset by a compression ratio of 57× and 56×, respectively, with only up to 0.3% loss of inference accuracy.
How to leverage lossy compression in different big-data related challenging research issues, such as checkpointing, communication, and storage performance? How to improve lossy compression quality for specific applications?
An in-depth understanding of the failure features of HPC jobs in a supercomputer is critical to the large-scale system maintenance and improvement of the service quality for users. In this paper, we investigate the features of hundreds of thousands of jobs in one of the most powerful supercomputers, the IBM Blue Gene/Q Mira, based on 2001 days of observations with a total of over 32.44 billion core-hours. We study the impact of the system's events on the jobs' execution in order to understand the system's reliability from the perspective of jobs and users. The characterization involves a joint analysis based on multiple data sources, including the reliability, availability, and serviceability (RAS) log; job scheduling log; the log regarding each job's physical execution tasks; and the I/O behavior log. We present 22 valuable takeaways based on our in-depth analysis. For instance, 99,245 job failures are reported in the job-scheduling log, a large majority (99.4%) of which are due to user behavior. The best-fitting distributions of a failed job's execution length (or interruption interval) include Weibull, Pareto, inverse Gaussian, and Erlang/exponential, depending on the types of errors (i.e., exit codes). The RAS events affecting job executions exhibit a high correlation with users and core-hours and have a strong locality feature. In terms of the failed jobs, our similarity-based event-filtering analysis indicates that the mean time to interruption is about 3.5 days.
In Checkpoint/Restart (C/R), Finding optimal checkpoint interval is important for reducing I/O workloads while maximizing resiliency of application executions. Typically we find the optimal checkpoint using stochastic models. With emergence of more complicated checkpointing strategies in HPC (Multi-level, complicated eraser encoding etc.), modeling these approaches is becoming very difficult. Another another approach is to reply on simulation techniques to find the optimal checkpoint interval. However, simulation is unacceptably time-consuming for practical use where application developers would like to know the optimal interval on a job submission. In this short talk, we introduce a checkpoint interval optimization technique with AI.
What common checkpoint/restart approaches are in practice ?
What resiliency strategies should be considered, Multi-level C/R, Asynchronous C/R, failure prediction, erasure encoding and others ?
Collaboration on C/R simulation framework development.
Title | Topic | Presenter | Authors | Abstract |
---|---|---|---|---|
Performance and Power Consumption Analysis of Arm Scalable Vector Extension with gem5 Simulator | Performance Tools | Tetsuya Odajima | Tetsuya Odajima, Yuetsu Kodama, Miwako Tsuji, Mitsuhisa Sato (RIKEN) | |
SLATE: Software for Linear Algebra Targeting Exascale | Numerical Methods | Jakub Kurzak | Jakub Kurzak, University of Tennessee Mark Gates, University of Tennessee Ali Charara, University of Tennessee Asim YarKhan, University of Tennessee Jack Dongarra, University of Tennessee | |
BONSAI: Benchmarking OpeN Software Autotuning Infrastructure | Performance Tools | Jakub Kurzak | Jakub Kurzak, University of Tennessee Mike Tsai, University of Tennessee Mark Gates, University of Tennessee Jack Dongarra, University of Tennessee | |
PaRSEC - a data-flow task-based runtime | Programming Languages and Runtimes | Yu Pei, Qinglei Cao | Reazul Hoque, Yu Pei, Qinglei Cao - University of Tennessee, Knoxville | |
Extending Open MPI with Tool and Resilience Support | Parallel Programming models and runtime; Performance tools; Resilience | David Eberius, Dong Zhong | David Eberius, Dong Zhong (Innovative Computing Laboratory, University of Tennessee) | |
PRIONN: Predicting Runtime and IO using Neural Networks | I/O, Storage and In-Situ Processing | Michael Wyatt | Michael Wyatt (UTK), Stephen Herbein (LLNL), Todd Gamblin (LLNL), Adam Moody (LLNL), Dong H Ahn (LLNL), Michela Taufer (UTK) | |
Modeling Record-and-Replay for Nondeterministic Applications on Exascale Systems | Resilience, Performance tools, Applications and mini-apps | Dylan Chapp | Dylan Chapp (University of Delaware, University of Tennessee, Knoxville), Danny Rorabaugh (University of Tennessee, Knoxville), Michela Taufer (University of Tennesee, Knoxville) | |
Performance improvements in Open MPI | Programming Languages and Runtimes | Xi Luo, Thananon Patinyasakdikul | Xi Luo, Thananon Patinyasakdikul (UTK) | |
Pseudo-Assembly Programming for Batched Matrix Factorization | scientific simulation, high-order PDE methods | Mike Tsai (UTK) | Mike Tsai (UTK) Piotr Luszczek (UTK) Jakub Kurzak (UTK) Jack Dongarra (UTK) | |
Improve Exascale IO via an Adaptive Lossy Compressor | Big Data, I/O and in-situ visualization | Xin Liang | Xin Liang (UCR), Sheng Di (ANL), Sihuan Li (UCR), Dingwen Tao (UA), Bogdan Nicolae (ANL), Zizhong Chen (UCR), Franck Cappello (ANL) | |
Towards SDC Resilient Error Bounded Lossy Compressor | Resilience, ABFT (Algorithm Based Fault Tolerance), Lossy compression | Sihuan Li | Sihuan Li, UC Riverside; Sheng Di, ANL; Xin Liang, UC Riverside; Zizhong Chen, UC Riverside; Franck Cappello, ANL | |
Towards Unified Tasking on CPUs and GPUs | Parallel Programming Models | Laura Morgenstern | Laura Morgenstern (JSC), Ivo Kabadshow (JSC) | |
Holistic Measurement Driven System Assessment | Architecture, Resilience | William Kramer | PI: Bill Kramer (NCSA/UIUC) NCSA: Greg Bauer Brett Bode Jeremy Enos Aaron Saxton Mike Showerman UIUC: Saurabh Jha (CS/CSL) Ravi Iyer (ECE/CSL) Zbigniew Kalbarczyk (ECE/CSL) SNL: Jim Brandt Ann Gentile | |
Modeling HPC applications for in situ Analytics | I/O, Storage and In-Situ Processing | Valentin HONORE | Guillaume AUPY, Brice GOGLIN, Valentin HONORE (Inria, LaBRI, Univ. Bordeaux, 33400 Talence, France) - Bruno RAFFIN (Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, 38000 Grenoble, France) | |
Enhancing Deep Learning towards Exascale with the DEEP-EST Modular Supercomputer Architecture | Parallel Programming models and runtime; Performance tools; | Sedona Rocco | Ernir Erlingsson, Gabriele Cavallaro, Morris Riedel, Rocco Sedona and Helmut Neukirchen (JSC) |
We examine the effect of vector length and the number of out-of-order resources in the performance and the runtime power consumption of benchmarks by using the gem5 processor simulator and McPAT framework. We evaluate the performance for multiple vector lengths using the gem5 processor simulator, which supports cycle-accurate out-of-order pipeline simulator. A long vector length improves peak computing performance because of an increase in the number of elements that can be executed in parallel, but it requires a wide die area. We also examine the power consumption for multiple vector lengths using the McPAT framework, which is the simulator for processor area and power consumption.
The objective of the Software for Linear Algebra Targeting Exascale (SLATE) project is to provide fundamental dense linear algebra capabilities to the US Department of Energy and to the high-performance computing (HPC) community at large, and ultimately to replace the venerable Scalable Linear Algebra PACKage (ScaLAPACK). SLATE is being developed from the ground up, with focus on scalability and support for hardware accelerators. This poster highlights SLATE's main design principles.
High-end computing relies increasingly on machines with large numbers of GPUs. At the same time, GPU kernels are commonly produced in the process of automated software tuning. The Benchtesting OpeN Software Autotuning Infrastructure (BONSAI) project provides a software infrastructure for deploying large performance tuning sweeps to supercomputers. BONSAI allows for parallel compilation and benchmarking of a large number of GPU kernels by dynamically scheduling work to a large numbers of distributed-memory nodes, with multiple GPU accelerators. In this poster we outline BONSAI design and highlight its main capabilities.
PaRSEC is a data-flow architecture-aware task-based runtime system supporting applications running on heterogeneous distributed machines, that has been shown to provide excellent performance to many scientific applications. A modular design allows PaRSEC to offer several programming interfaces, and provide efficient support for Domain Specific Langages. In this poster we demonstrate one such high level interface – Dynamic Task Discovery, a more explicit low level interface – Futures and present recent work in task-based Data Redistribution, 2D Stencil, Low Rank Cholesky and their performance results.
This poster details the implementation and usage of two extensions to Open MPI, Software-based Performance Counters (SPCs) and Resilient PMIx Reference RunTime Environment (PRRTE). SPCs expose internal Open MPI performance information, that would otherwise be inaccessible, through the MPI_T interface. Resilient PRRTE provides an efficient runtime-level failure detection and propagation strategy targeting exascale systems. These extensions provide a more robust environment for Open MPI with additional tool support through the MPI_T interface and fault tolerance through Resilient PRRTE which supports ULFM.
For job allocation decision, current batch schedulers have access to and use only information on the number of nodes and runtime because it is readily available at submission time from user job scripts. User-provided runtimes are typically inaccurate because users overestimate or lack understanding of job resource requirements. Beyond the number of nodes and runtime, other system resources, including IO and network, are not available but play a key role in system performance. There is the need for automatic, general, and scalable tools that provide accurate resource usage information to schedulers so that, by becoming resource-aware, they can better manage system resources.
We tackle this need by presenting a tool for Predicting Runtime and IO using Neural Networks (PRIONN). PRIONN automates prediction of per-job runtime and IO resource usage, enabling IO-aware scheduling on HPC systems. The novelty of our tool is the input of whole job scripts into deep learning models that allows complete automation of runtime and IO resource predictions. We demonstrate the power of PRIONN with runtime and IO resource predictions applied to IO-aware scheduling for real HPC data. Specifically, we achieve over 75\% mean and 98\% median accuracy for runtime and IO predictions across 300,000 jobs from a real HPC machine. We combine our per-job runtime and IO predictions with queue and system simulations to predict future system IO usage accurately. We predict over 50\% of IO bursts in advance on a real HPC system.
Record-and-replay (R&R) techniques present an attractive method for mitigating the harmful aspects of nondeterminism in HPC applications (e.g., numerical irreproducibility and hampered debugging), but are hamstrung by two problems. First, there is insufficient understanding of how existing R&R techniques cost of recording responds to changes in application communication patterns, inputs, and other aspects of configuration, and to the degree of concurrency. Second, current R&R techniques have insufficient ability to exploit regularities in the communication patterns of individual applications.
We contend that it is crucial that the HPC community is equipped with modeling and simulation methodologies to assess the response of R&R tools, both in terms of execution time overhead and memory overhead, to changes in the configuration of the applications they monitor. We introduce a graph-based modeling methodology for quantifying the degree of nondeterminism in fixed applications across multiple runs and for comparing nondeterminism across multiple applications based on the varying graph structure of the modeled executions.
These models will enable scientists and application developers to make informed decisions about the scenarios in which R&R tools can be deployed cost-effectively, thus increasing their usability and utility in HPC as a whole. Moreover, this modeling effort will provide insights into the recording costs associated with the various communication patterns endemic to HPC applications, which in turn will enable the development of R&R tools that exploit these patterns to realize more compact representations of executions and thus reduced recording overhead.
HPC systems have seen tremendous improvements in computational power, advances unmatched by communication capabilities, leading to a substantial imbalance between computation and communication power. Thus, data movement has become a new bottleneck in some large scale applications. With this in mind, It is essential for communication engine such as MPI to perform as efficiently as possible. This poster focuses on the study and various improvements for multithread communication and collective operations in Open MPI and how it performs in the modern HPC systems.
While CUDA and OpenCL provide software infrastructure for hardware accelerator programming, the optimizations from their compilers are not always fully capable of giving the programmer the important level of control of the generated code. Sometimes this may lead to degraded performance. However, there are lower-level software layers that include intermediate representation (IR, AMD IL/HSAIL) or pseudo-assembly (LLVM, PTX) which could potentially give the programmer much better management of register allocation and memory affinity without disruption of optimal flow control. In this poster, we show the results of batched matrix factorization written in NVIDIA PTX for Tesla line of GPU cards. The experimental data demonstrate the performance advantages of the low-level programming in comparison with CUDA kernels.
Because of the ever-increasing data being produced by today’s large-scale scientific simulations, I/O performance is becoming a significant bottleneck for their executions. In this paper, we explore how to significantly improve the I/O performance for large-scale scientific simulations by leveraging an optimized lossy compressor. Our contribution is threefold. (1) We propose a compression-based I/O performance model and investigate the relationship between parallel I/O performance and lossy compression quality, by pro- cessing large-scale scientific simulation data on a supercomputer with various state-of-the-art lossy compressors. (2) We propose an adaptive, prediction-based lossy compression framework that can select the best-fit strategy from among our optimized prediction approaches in terms of the datasets, in that different prediction methods are particularly effective in different cases. (3) We evaluate the parallel I/O performance by using three large-scale simulations across different domains with up to 8,192 cores. Experiments show that our adaptive compressor can improve the I/O performance by about 100% compared with the second-best lossy compressor because of significantly reduced data size. In absolute terms, our compressor can improve the compression ratio by 112∼165% com- pared with the second-best lossy compressor. The total I/O time is reduced by up to 60X with our compressor compared with the original I/O time.
Lossy compression has been demonstrated very effective at extreme-scale scientific simulations in terms of improving IO performance, reducing disk space usage and so on. It has been applied to many scientific simulations in various science fields including molecular chemistry, fluid dynamics, climate, cosmology and so on. Such an important application will silently impact the correctness of scientific simulation results if it was hit by soft errors. The existing fault tolerance techniques usually will use redundancy at the cost of duplicated hardware or doubled execution time. In this work, we exploit the possibility to achieve resiliency for one of the most widely used lossy compressor, SZ, against SDC from the perspective of software or algorithm. Preliminary results have shown our proposed solution can increase the SDC resilience of SZ significantly with negligible overheads at large scale.
Task parallelism is omnipresent these days. Despite the success of task parallelism on CPUs, there is currently no performant way to exploit the task parallelism of synchronization-critical algorithms on GPUs. Due to this shortcoming, we develop a tasking approach for GPU architectures. Our use case is a fast multipole method for molecular dynamics (MD) simulations. Since the problem size in MD is typically small, we have to target strong scaling. Hence, the application tends to be latency- and synchronization-critical. Therefore, offloading as the classical programming model for GPUs is unfeasible. The poster highlights our experience with the design and implementation of tasking as alternative programming model for GPUs using CUDA. We describe the tasking approach for GPUs based on the design of our tasking approach for CPUs. Following this, we reveal several pitfalls implementing it. Among others, we consider warp-synchronous deadlocks, weak memory consistency and hierarchical multi-producer multi-consumer queues. Finally, we provide first performance results of a prototypic implementation.
The Holistic Measurement Driven System Assessment (HMDSA) project is designed to enable maximum science production for large-scale high performance computing (HPC) facilities, independent of major component vendor, and within budget constraints of money, space, and power. We accomplish this through development and deployment of scalable, platform-independent, open-source tools and techniques for monitoring, coupled with runtime analysis and feedback, which enables highly efficient HPC system operation and usage and also informs future system improvements.
We take a holistic approach through:
- Monitoring of all performance impacting information (e.g., atmospheric conditions, physical plant, HPC system components, application resource utilization) with flexible frameworks for variable fidelity data collection to minimize application and/or system overhead,
- Developing scalable storage, retrieval, and run-time analytics to provide identification of performance impacting behaviors,
- Developing feedback and problem (e.g., faults, resource depletion, contention) mitigation strategies and mechanisms targeting applications, system software, hardware, and users.
With the goal of performing exascale computing, the importance of I/O management becomes more and more critical to maintain system performance. While the computing capacities of machines are getting higher, the I/O capabilities of systems do not increase as fast. We are able to generate more data but unable to manage them efficiently due to variability of I/O performance. Limiting the requests to the Parallel File System (PFS) becomes necessary. To address this issue, new strategies are being developed such as online in situ analysis. The idea is to overcome the limitations of basic post-mortem data analysis where the data have to be stored on PFS first and processed later. There are several software solutions that allow users to specifically dedicate nodes for analysis of data and distribute the computation tasks over different sets of nodes. Thus far, they rely on a manual resource partitioning and allocation by the user of tasks (simulations, analysis).
In this work, we propose a memory-constraint modelization for in situ analysis. We use this model to provide different scheduling policies to determine both the number of resources that should be dedicated to analysis functions, and that schedule efficiently these functions. We evaluate them and show the importance of considering memory constraints in the model.
The Dynamical Exascale Entry Platform – Extreme Scale Technologies (DEEP-EST)1 aims at delivering a pre-Exascale platform based on a Modular Supercomputer Architecture (MSA) which provides among a standard CPU cluster module, a many-core Extreme Scale Booster (ESB), a Global Collective Engine (GCE) to speed-up MPI collective operations in hardware, Network Attached Memory (NAM) as a fast scratch file replacement, and a hardware accelerated Data Analytics Module (DAM); the latter are able to perform near-data processing.
Title | Presenter | |
---|---|---|
Not Your Grandfather’s Tractor – How AI & IoT Are Transforming Production Agriculture | Mark Moran John Deere | |
Abstract: After some background on the sweeping changes happening in agricultural technology as a result of General Purpose Technologies, Mark will discuss how AI and IoT are merging on the farm, creating some compelling problems to be solved on the edge. | ||
Programming workflows for Advanced Cyberinfrastructure Platforms | Rosa M. Badia Barcelona Supercomputing Center | |
Abstract: In the design of an advanced cyberinfrastructure platform (ACP), that involve sensors, edge devices, instruments, computing power in the cloud and HPC, a key aspect is how to describe the applications to be executed in such platform. Very often these applications are not standalone, but involve a set of sub-applications or steps composing a workflow. The scientists then rely on effective environments to describe their workflows and engines to manage them in complex infrastructures. COMPSs is a task-based programming model that enables the development of workflows that can be executed in parallel in distributed computing platform. The workflows that we currently support may involve different types of tasks, such as parallel simulations (MPI) or analytics (i.e., written in Python thanks to PyCOMPSs, the Python binding for COMPSs). COMPSs, through and storage interface, makes transparent the access to persistent data stored in key-value databases (Hecuba) or object-oriented distributed storage environments (dataClay). While COMPSs has been developed from its early times for highly distributed environments, we have been extending it to deal with more challenging environments, with edge devices and components in the fog, that can appear and disappear. | ||
Delivering on the Exascale Computing Project Mission for the U.S. Department of Energy | Douglas B. (Doug) Kothe Oak ridge National Laboratory | |
Abstract: The vision of the U.S. Department of Energy (DOE) Exascale Computing Project (ECP), initiated in 2016 as a formal DOE project executing through 2023, is to accelerate innovation with exascale simulation and data science solutions that enhance U.S. economic competitiveness, strengthen our national security, and change our quality of life. ECP’s mission is to deliver exascale-ready applications and solutions that address currently intractable problems of strategic importance and national interest; create and deploy an expanded and vertically integrated software stack on DOE HPC exascale and pre-exascale systems, thereby defining the enduring US exascale ecosystem; and leverage U.S. HPC vendor research activities and products into DOE HPC exascale systems. The project is a joint effort of two DOE programs: the Office of Science Advanced Scientific Computing Research Program and the National Nuclear Security Administration Advanced Simulation and Computing Program. ECP’s RD&D activities are carried out by over 100 teams of scientists and engineers from the DOE national laboratories, universities, and U.S. industry. These teams have been working together since the fall of 2016 on the development of applications, software technologies, and hardware technologies and architectures: Applications: Creating or enhancing the predictive capability of applications through algorithmic and software advances via co-design centers; targeted development of requirements-based models, algorithms, and methods; systematic improvement of exascale system readiness and utilization; and demonstration and assessment of effective software integration. Software Technologies: Developing and delivering a vertically integrated software stack containing advanced mathematical libraries, extreme-scale programming environments, development tools, visualization libraries, and the software infrastructure to support large-scale data management and data science for science and security applications. Hardware and Integration: Supporting U.S. HPC vendor R&D focused on innovative architectures for competitive exascale system designs; objectively evaluating hardware designs; deploying an integrated and continuously tested exascale software ecosystem at DOE HPC facilities; accelerating application readiness on targeted exascale architectures; and training on key ECP technologies to accelerate the software development cycle and optimize productivity of application and software developers. Illustrative examples will be given on how the ECP teams are delivering in these three areas of technical focus, with specific emphasis on recent work on the world’s #1 supercomputer, namely the Summit leadership system at Oak Ridge National Laboratory. |
Mark Moran is the Director of the John Deere Technology Innovation Center in Research Park at the University of Illinois, where he leads a team focused on digital innovation including UX, AI/ML, IoT, Mobile, Cloud and Autonomy. Mark has a BS in Engineering from the University of Illinois, an MBA from the University of Iowa, and SM in System Design and Management from MIT. He is currently pursuing his PhD at the University of Illinois, focusing on Cognitive Science and Natural Language Processing. Mark and his wife Lori have two college-aged daughters and four dogs.
Rosa M. Badia holds a PhD on Computer Science (1994) from the Technical University of Catalonia (UPC).
She is the manager of the Workflows and Distributed Computing research group at the Barcelona Supercomputing Center (BSC).
Her current research interest are programming models for complex platforms (from edge, fog, to Clouds and large HPC systems). The group led by Dr. Badia has been developing StarSs programming model for more than 10 years, with a high success in adoption by application developers. Currently the group focuses its efforts in PyCOMPSs/COMPSs, an instance of the programming model for distributed computing including Cloud. The group is extending the model to be able to consider edge devices that offload computing to the fog and to the cloud.
Dr Badia has published near 200 papers in international conferences and journals in the topics of her research. Her group is very active in projects funded by the European Commission and in contracts with industry.
Douglas B. Kothe (Doug) has over three decades of experience in conducting and leading applied R&D in computational science applications designed to simulate complex physical phenomena in the energy, defense, and manufacturing sectors. Doug is currently the Director of the U.S. Department of Energy (DOE) Exascale Computing Project (ECP). Prior to that, he was Deputy Associate Laboratory Director of the Computing and Computational Sciences Directorate (CCSD) at Oak Ridge National Laboratory (ORNL). Other prior positions for Doug at ORNL, where he has been since 2006, include Director of the Consortium for Advanced Simulation of Light Water Reactors, DOE’s first Energy Innovation Hub (2010-2015), and Director of Science at the National Center for Computational Sciences (2006-2010).
Before coming to ORNL, Doug spent 20 years at Los Alamos National Laboratory, where he held a number of technical and line and program management positions, with a common theme being the development and application of modeling and simulation technologies targeting multi-physics phenomena characterized in part by the presence of compressible or incompressible interfacial fluid flow. Doug also spent one year at Lawrence Livermore National Laboratory in the late 1980s as a physicist in defense sciences.
Doug holds a Bachelor in Science in Chemical Engineering from the University of Missouri – Columbia (1983) and a Masters in Science (1986) and Doctor of Philosophy (1987) in Nuclear Engineering from Purdue University.
Project: Optimization of Fault-Tolerance Strategies for Workflow Applications
This work compares the performance of different approaches to tolerate failures for applications executing on large-scale failure-prone platforms.
Project: Effective Use of Lossy Compression for Numerical Linear Algebra Resilience and Performance
Fault-tolerance is one of the major challenges of extreme-scale computing. There is broad consensus that future leadership-class machines will exhibit a substantially reduced mean-time-between-failure (MTBF) compared to today’s systems because the of the expected increase in the number of components without commensurately improving the per component reliability. The resilience challenge at scale is best summarized as, faults and failures are likely to become the norm rather than the exception: Any simulation run will be compromised without inclusion of resilience techniques into the underlying software stack and system. Therefore we investigate recovery approaches for partially lost data in iterative solvers using compressed checkpoints and data reconstruction techniques.
Multigrid methods are optimal iterative solvers for elliptic problems and are widely used as preconditioners or even standalone solvers. Multigrid methods provides the opportunity to use the underlying hierarchy to create compressed checkpoints in a intuitive way by restricting the iterative solution to coarser levels. A second lossy compression technique uses SZ, a specially designed floating-point lossy data compressor, and has been shown to be improve classical checkpointing approaches for the recovery of time-marching systems. SZ compression, unlike multigrid compression, allows users to prescribe accuracy targets and by this is more easily adaptable to the needs of the iterative solver. Using these two compression techniques, we evaluate their usability for restoring partially lost data, e.g. due to node-losses, of the iterative approximation during a linear solve. We compare different recovery approaches from simple value-wise replacement without post-processing up to solving local auxiliary problems and their efficiency. For the efficiency evaluation, we focus mainly on compression rate, numerical overhead and necessary communication. Furthermore, we investigate the checkpoint-frequency. Preliminary results show that there is no “Swiss Army Knife” which is optimal in all cases. Therefore, it is inevitable to adapt the methods with respect to the problem and the state of the iterative solver.
Project: Checkpoint/Restart of/from lossy state
Lossy compression algorithms are effective tools to reduce the size of HPC datasets. As established lossy compressors such as SZ and ZFP evolve they seek to improve the compression/decompression bandwidth and the compression ratio. As the the underlying algorithms of these compressors evolve, the spatial distribution of errors in the compressed data changes even for the same error bound and error bound type. Recent work has shown that properties of the simulation such as choice of boundary conditions, PDE properties, and domain geometry significantly impacts an application's ability to compute on state from lossy compressed data files. If HPC applications are to compute on data coming from compressed data files, we require an understanding of how the spatial distribution of error changes. This talk explores how spatial distributions of error, compression/decompression bandwidth, and compression ratio changes for HPC data sets from the applications PlasComCM and Nek5000 between various versions of SZ and ZFP. In addition, we explore how the spatial distribution of error impacts the correctness of the applications and the ability to create methodologies to recommend lossy compression error bounds.
Project: Checkpoint/Restart of/from lossy state
To solve large scale linear problems Krylov methods are among the most widely used iterative methods. Pipelined variations - which overlap communication and computations - have also been developed in order to handle exascale problems on HPC hardware. A downside of these methods is, however, the high amount of memory that is required to, among others, store a basis for the Krylov subspace.
Recently, interest has grown in the use of lossy compression techniques in order to reduce the I/O footprint of large scale computations when, for example, out-of-core calculations or checkpointing is used. These lossy compression techniques allow for much higher compression rates than normal compression techniques, while at the same time maintaining precise control on the error introduced by the compression algorithm.
On the other side, it has been shown that Krylov methods allow for some inexactness in the matrix vector product, it might be possible to combine them with these lossy compression techniques. In this talk we will explore this idea and show some preliminary results.
This is a joint work with with Emmanuel Agullo (Inria), Luc Giraud (Inria), Franck Cappelo (ANL)
References:
1. L. Giraud, S. Gratton, and J. Langou. Convergence in backward error of relaxed GMRES. SIAM J. Scientific Computing, 29(2):710–728, 2007.
2. V. Simoncini and D. B. Szyld. Theory of inexact Krylov subspace methods and applications to scientific computing. SIAM Journal Scientific Computing, 25:454–477, 2003.
3. D. Tao, S. Di, F. Cappello. A Novel Algorithm for Significantly Improving Lossy Compression of Scientific Data Sets, International Parallel and Distributed Processing Symposium (IEEE/ACM IPDPS), 2017.
4. J. van den Eshof and G. L. G. Sleijpen. Inexact Krylov subspace methods for linear systems. SIAM Journal on Matrix Analysis and Applications, 26(1):125–153, 2004
Project: HPC libraries for solving dense symmetric eigenvalue problems
One of the significant purposes of this project is to provide the JLESC community with an overview of pros and cons of existing software, and a perspective view of better adaptation to present and planned computers and accelerators. In this talk, we review the GPU eigensolvers and the performance benchmark results of four kernels (cuSolverDnDsyevd, magma_dsyevd, magma_dsyevdx_2stage, and eigen) with a single Titan-V (GV-100) acceleration. Also, some preliminary progress reports of ELPA on a GPU cluster will be presented.
Project: Scalability Enhancements to FMM for Molecular Dynamics Simulations
In order to use today's heterogeneous hardware efficiently, intra-node and inter-node parallelization is required.
Application developers can choose between fully integrated PGAS approaches hiding the distributed layout of the memory and MPI+X approaches separating shared and distributed memory communication.
The first is perfectly suited if the communication overhead of the transferred data is small compared to the attached computational effort.
In case more control over the message size and message timing is required, the latter approach provides better tuning possibilities.
However, using message passing directly in the application clutters up the code and mixes the often non-trivial communication algorithm with the application algorithm.
In this talk, we explore how to extend our tasking framework, which is used for a synchronization-critical Fast Multipole Method (FMM), to support distributed memory communication.
The tasking framework is based on C++11 and handles the efficient execution of different computational tasks arising in an FMM workflow.
Tasks are configured at compile-time via template meta-programming and mapped to ready-to-execute queues once all dependencies are met.
We show how this concept can be extended to communication tasks and provide preliminary results.
Project: Fast Integrators for Scalable Quantum Molecular Dynamics
Accurate modeling of excited electron dynamics in real materials via real-time time-dependent density functional theory (RT-TDDFT) relies on efficient integration of the time-dependent Kohn-Sham equations, a large system of non-linear PDEs. Traditional time-stepping algorithms like fourth-order Runge-Kutta require extremely short time steps to control error, leading to high computational cost and limiting the feasible system size and simulation time. We have accelerated the search for more efficient numerical integration schemes by interfacing the massively parallel Qbox/Qb@ll RT-TDDFT code with the PETSc library. In this talk, we compare the accuracy, stability, and overall merit of various time-steppers available within PETSc for RT-TDDFT problems, and we investigate the potential of adaptive time stepping to reduce time to solution.
Project: The ChASE library for large Hermitian eigenvalue problems
One of the most pressing bottlenecks of the original Jena VASP-BSE code is the solution of a large and dense Hermitian eigenvalue problem. While computing the Hamiltonian matrix itself is computationally expensive, there were two main issues linked to the solution of the eigenproblem: 1) the storage and reading of the matrix and 2) the computation of a small set of extremal eigenpairs. Thanks to an intense and fruitful collaboration we were able to restructure and parallelize the reading of the Hamiltonian matrix, and accelerate the computation of the desired eigenpairs making use of the ChASE library. The effort resulted in a tremendous performance improvement, and a drastic increase in strong scalability and parallel efficiency. In addition, the Jena BSE code can now access new regimes of k-point sampling density that lead to very large Hamiltonians and were previously inaccessible. The new results show how the use of a tailored eigensolver can extend the range of the domain simulation parameters paving the way to the study of new physical phenomena.
Project: Sharing and extension of OmpSs2 runtime for XMP 2.0 PGAS task model
Task parallelism is considered as one of the promising parallel programming paradigm to handle large irregular applications and many-core architecture such as the posk-K computer. R-CCS is developing a parallel programming model named XcalableMP (XMP) and its recent specification introduced task parallel programming model for distributed memory system. In this talk, we will provide the recent result of XMP 2.0 task model implemented by OmpSs2 task runtime shared by Barcelona Supercomputing Center.
Project: Shared Infrastructure for Source Transformation Automatic Differentiation
We discuss the self-adjoint shared memory parallelization (SSMP) and transposed forward-mode algorithmic differentiation (TFMAD) strategies for parallelizing reverse-mode derivative computation. We describe a prototype analysis tool and provide an overview of our plan for incorporating the analyses and transformations into the Tapenade algorithmic differentiation tool.
Project: Reducing Communication in Sparse Iterative and Direct Solvers
The talk will study on-node threading integration with iterative methods. In particular, how usage of non-blocking techniques at various scales performs. Threaded MPI comparison will be included to study improvements of Krylov space methods. The benefits for variants of Krylov methods is not spread equally. Finally, mixing of s-step methods with pipelining benefits performance and maintains stability. A quick note on programmer productivity from using threading and/or non-blocking is presented.
Project: Reducing Communication in Sparse Iterative and Direct Solvers
Enlarged Krylov subspace methods aim to tackle the poor scalability of classical Krylov subspace methods on large supercomputers by reducing the number of iterations to convergence. This reduction comes at the cost of a large increase in the computation required per iteration compared to that of classical Krylov methods. This talk will discuss the design of scalable solvers based on enlarged Krylov subspace methods, including the introduction of on-node parallelism for utilizing emerging supercomputer architectures, as well as preconditioning techniques to further reduce the iteration count.
Project: Developer tools for porting and tuning parallel applications on extreme-scale parallel systems
Developments in the partners' tools and their interoperability will be reported, along with their use with large-scale parallel applications. In particular, early experience with performance analysis of applications using the JURECA cluster+booster modular supercomputing architecture will be presented.
Project: Deep Memory Hierarchies
We study the feasibility of using Intel’s Processor Event-Based Sampling (PEBS) feature to record memory accesses by sampling at runtime and study the overhead at scale. We have implemented a custom PEBS driver in the IHK/-McKernel lightweight multi-kernel operating system, one of whose advantages is minimal system interference due to the lightweight kernel’s simple design compared to other OS kernels such as Linux.
Project: Towards accurate network utilization forecasting using portable MPI-level monitoring
In this talk we will present a low-level monitoring system implemented within Open MPI and how we leverage it to forecast network usage of the application.
Project: MPI International Survey
The preliminary result of the MPI International Survey will be presented.
Project: Reconfiguring Distributed Storage Systems on HPC infrastructures
Parallel storage systems are becoming an increasing bottleneck for HPC applications as their performance does not scale as fast as that of computing resources. A potential solution to this challenge is to use transient distributed storage systems that are deployed on the compute nodes only during the application's run time. Rescaling such a distributed storage system is a complex operation that can be beneficial in many situations: to adjust a system to its workload, to have a configuration fitted to each phase of a workflow, or to save core hours or energy. However, rescaling a distributed storage system necessarily involves numerous data transfers that are assumed to be too slow to be used at run time.
In this talk we present Pufferbench, a benchmark designed to estimate how long would a rescaling operation last in practice on a given platform.
Implementing an efficient rescaling mechanism in an actual distributed storage system is a bigger challenge. The load on each storage node should be balanced to avoid performance degradation from hotspots. The load, however, is not often linked to the amount of data stored per node, metric that should be balanced in order to have reliable and efficient rescaling operations. Optimizing both is a challenge that must be overcome to implement rescaling operations in distributed storage systems.
In the second part of this talk, we present early results with Pufferscale, a scheduler that organizes data transfers during rescaling operations and that ensures fast and reliable durations while balancing the load.
Project: Advancing Chameleon and Grid'5000 testbeds
Chameleon develops and operates a large-scale, deeply reconfigurable platform to support computer science systems research. Resources include a large homogenous partition comprising 600 Intel Haswell nodes as well as smaller clusters representing architectural diversity, including ARMs, Atoms, and low-power Xeons, SkyLakes, and hardware including SDN-enabled programmable switches, Infiniband connectors, large RAM, non-volatile memory, different types of SSDs, GPUs, and FPGAs. Chameleon supports bare metal reconfiguration, including booting from a custom kernel, network stitching and software-defined networking experiments. To date, Chameleon has served 3,000+ users working on 500+ research and education projects.
The Chameleon Phase 2 includes a range of new services including CHI-in-box, a packaging of the system allowing others to deploy similar testbeds, a range of new networking capabilities, and most importantly, new services such as Jupyter integration, new orchestration features, and Experiment Precis and that allow investigators to structure their experiments in a way that allows for easy repetition and recording, and meaningful sharing with experimental community. This talk will provide an overview of new services added to Chameleon in the last year, focusing on experimental techniques.
Project: Improving the Performance and Energy Efficiency of HPC Applications Using Autonomic Computing Techniques
Behaviour of HPC systems, such as performance, power consumption, thermal distribution, are increasingly hard to predict due to process variations and dynamic processors. Additionally computing facilities are now interested in limiting computing resources by power/energy budget rather than CPU time. We propose a software-level approach to provide control mechanisms that increase hardware power-limiting feature and work for multiple nodes by applying autonomic computing techniques and targeting HPC workloads. We adopt an approach involving Autonomic Computing and Control Theory. We present the identified technical issues, our approach, and first results.
Project: Towards Blob-Based Convergence Between HPC and Big Data
Blobs are arguably a solid storage model in a converging context between HPC and Big Data applications. Although there is a clear mapping between blobs and the storage systems used for batch data processing, streaming data is largely ignored by current state-of-the-art storage systems. Applications requiring processing of a stream of events have specific requirements that can span from in-network data processing, delivery guarantees or latency. Unfortunately, these feature lack in HPC environments. In this talk, we discuss the design of a large-scale logging framework based on converging Blob storage that helps fill the gap between HPC and Big Data applications relying on streaming data processing.
Project: Extreme-Scale Workflow Tools - Swift, Decaf, Damaris, FlowVR
Workflow systems promise scientists an automated end-to-end path from hypothesis to discovery. Expecting any single workflow system to deliver such a wide range of capabilities is impractical, however. A more practical solution is to compose the end-to-end workflow from more than one system. With this goal in mind, the integration of distributed and in situ workflows is explored, where the result is a hierarchical heterogeneous workflow composed of subworkflows, with different levels of the hierarchy using different programming, execution, and data models. In this talk, we present the results of our investigation together with the lessons learned from our integration which we hope that will increase understanding and motivate further research into heterogeneous workflow composition.
Project: Evaluating high-level programming models for FPGA platforms
Compared to central processing units (CPUs) and graphics processing units (GPUs) which have fixed architectures, field-programmable gate arrays (FPGAs) offer reconfigurability and promising performance and energy efficiency. With these FPGA-based heterogeneous computing systems, programming standards have emerged to facilitate transformation of algorithms from standard systems to heterogeneous systems. Open Computing Language (OpenCL) is a standard framework for writing programs that execute across various heterogeneous computing platforms. Investigating the characteristics of kernel applications with emerging OpenCL-to-FPGA development flows is important for researchers, who have little hardware development experience, to evaluate and adopt the FPGA-based heterogeneous programming model in a laboratory. In this talk, I will summarize the evaluations and optimizations of OpenCL kernels on an Arria10-based OpenCL FPGA platform. The kernels are derived from the streaming, integer and floating-point intensive, and proxy applications. The experimental results show that FPGAs are promising heterogeneous computing components for energy-efficient high-performance computing.
Project: Evaluating high-level programming models for FPGA platforms
FPGAs have already demonstrated their acceleration potential for several important workloads, particularly emerging workloads to HPC. However, due to its hardware mentality, no standard common FPGA abstraction layers, programming interfaces and runtime system exist, thus preventing the HPC community from building their own FPGA-based clusters. We are tackling to solve such issue by cultivating a FPGA-HPC community in order to discuss abstraction technique, standardization, identify benchmarking method and implementation, common programming model and environment, and so on. We have organized several successful workloads (some co-held with major conferences). We will talk about our community efforts, technical outcomes and next steps.
Project: Simplified Sustained System performance benchmark
We have been developing a new benchmark metric, Simplified Sustained System performance benchmark (SSSP). This talk, I'll introduce some new results on the Intel SKL and the Arm ThunderX2.
Project: Resource Management, Scheduling, and Fault-Tolerance for HPC Workflows
While the use of workflows for HPC is growing, MPI interoperability remains a challenge for workflow management systems. The MPI standard and/or its implementations provide a number of ways to build multiple-programs-multiple-data (MPMD) applications, but these have limited applicability to workflow-like applications. In this presentation, we will update the JLESC community on the status of the MPI Launch feature that is currently accessible as a prototype on clusters and Cray systems. We will focus on two current activities around this prototype, 1) coupling parallel applications for in situ data transfer workflows and 2) the effort for MPI standardization of these features.
Continue execution or interrupt and launch another task? How to take difficult scheduling decisions.
Initiate collaborations on scheduling stochastic tasks and workflows.
Task parallelism is omnipresent these days; whether in data mining or machine learning, for matrix factorization or even molecular dynamics (MD).
Despite the success of task parallelism on CPUs, there is currently no performant way to exploit the task parallelism of synchronization-critical algorithms on GPUs.
Due to this shortcoming, we aim to develop a tasking approach for GPU architectures.
Our use case is a fast multipole method for MD simulations.
Since the problem size in MD is typically small, we have to target strong scaling.
Hence, the application tends to be latency- and synchronization-critical.
Therefore, offloading as the classical programming model for GPUs is unfeasible.
In this short talk we provide our experience with the design and implementation of tasking as alternative programming model for GPUs using CUDA.
In this short talk, we reveal several pitfalls that occur when implementing a tasking framework for GPUs and hope for vivid discussions about eliminating them. There is a bunch of open questions regarding i.a. warp-synchronous deadlocks, weak memory consistency and hierarchical multi-producer multi-consumer queues.
I will attempt to use one talk to discuss 2 issues:
1) The use of Python-based workflows at extreme-scale. Based on work in Parsl (http://parsl-project.org/), we know we can run large numbers of tasks at large scale on HPC resources, as well as using the same programming model for multiple resources (HPC systems, clouds, etc.) Here, I am interested in finding ways to collaborate with other workflow system developers, either via common user-facing APIs or by using common resource-facing APIs, and underlying libraries, as well as finding potential Parsl users and working with them to both find and fill gaps in Parsl and to improve their applications.
2) Software sustainability. While researchers at national labs may be recognized for their software, this is problematic in academia. But in order for software to be sustained over more than the life of one project, the developers and maintainers need recognition aligned with their institution's mechanisms (e.g. hiring, promotion, tenure, etc.) I have been developing the concept of citation of software (https://doi.org/10.7717/peerj-cs.86 & https://www.force11.org/group/software-citation-implementation-working-group) for this purpose, and would like to talk about it, and understand what gaps it has in this extreme-scale environment, as well as collaborating with others interested in exploring it or other software sustainability mechanisms.
see above
Emerging HPC workflows whose execution times are stochastic and unpredictable pose new challenges and opportunities for scheduling. This work presents alternative algorithms to schedule stochastic jobs with the aim of optimizing the system and/or user level metrics. By leveraging traces of some typical neuroscience applications that exhibit such behavior, we show that traditional HPC schedulers are not suitable for these jobs and we demonstrate the effectiveness of the new scheduling algorithms with significant improvement in achievable performance.
Many open questions remain for scheduling stochastic jobs in HPC, including determining the optimal parallel execution of these jobs, and how to leverage checkpointing/restart to cope with the unpredictable behavior of these jobs. The latter question also connects naturally to the resilience of (either deterministic or stochastic) HPC workloads in faulty execution environments.
We consider irregular applications that have irregular patterns in data blocking and distribution, with a form of sparsity (e.g., irregular Cartesian tiling of a 2D matrix with block-sparsity). Scheduling computations on these data structures pose problems of load balancing and computational intensity that make overlapping communications by computations even harder. When targeting hybrid distributed large scale platforms with many GPUs per node, the support of runtime systems to schedule this execution helps but is not sufficient, and the algorithm needs to be adapted to the problem.
What is the proper design of performance modeling and heuristics to help the scheduling?; Can we achieve a purely scheduling-based approach (in opposition of a control flow embedded in the algorithm)?
A critical component of any high level task-based runtime system is the interface it offers. It is critical for two major reasons: users primarily interact with the runtime through it and the interface determines the constraint on the offered capabilities of the runtime. PaRSEC is an example of a high level task-based runtime system that offers multiple interfaces to developers. One such interface is Dynamic Task Discovery that allows PaRSEC to support applications exhibiting dynamic nature by building a task graph at runtime. In this talk we will present the scalability issue this interface suffers from and a way to address this issue.
How to efficiently build a task graph with minimal information from the users?
PaRSEC is a distributed task based runtime system that transparently manages the communication between nodes and across heterogeneous devices. By adopting asynchronous operations, we reduce runtime overhead and thus improve application performance. The Future construct has been proven as an effective asynchrony provider in many programming languages and runtime systems. In this talk I will present our current design for usage of Futures, preliminary results on some benchmarks and the ongoing PaRSEC optimization work using Futures. We believe that our generic approach can be broadly adopted by other systems.
Future life cycle management and it's potential use cases
In this work, we are interesting in modeling HPC applications for in situ analytics. The objective is to design robust application models for in situ analytics. Those models would give user a resource partitioning of the system as so as a scheduling of application tasks. The goal is to improve application performance and reduce overall cost by providing an efficient resource usage for an application running on a targeted platform.
Starting from a first model, we want now to evaluate this model on real-world application setup to determine 1) efficiency of our resource partitioning VS "by-hand" resource partitioning 2) efficiency of some proposed scheduling policies. The results would provide us new perspectives for future model enhancements.
OPEN QUESTIONS:
1) From a theoretical perspective, what is an HPC application and how to improve platform representation (burst buffers or NVRAM usage etc)? This includes the needs in terms of in situ analytics (task modeling, data locality, dependencies between analysis etc) as so as the general platform model.
2) How to design effective scheduling policies for analysis tasks? Can we guarantee a performance bound?
3) What are the strategies to deploy in order to evaluate such theoretical models on real-world application?
4) In general, what is the community expectation about such a work?
COLLABORATION OPPORTUNITIES:
A) We would like to test our models on real applications. Any collaboration for this purpose will be welcomed. We need experimental setup and applications to evaluate our models.
B) Any collaboration on the modeling part will be highly welcomed, including discussions on scheduling and storage issues.
Many problems remain to be solved in this work, and any collaboration or discussion on this work will be appreciated during the workshop.
Big Data applications are increasingly moving from batch-oriented execution models to stream-based models that enable them to extract valuable insights close to real-time. To support this model, an essential part of the streaming processing pipeline is data ingestion, i.e., the collection of data from various sources (sensors, NoSQL stores, filesystems, etc.) and their delivery for processing. Data ingestion needs to support high throughput, low latency and must scale to a large number of both data producers and consumers. Since the overall performance of the whole stream processing pipeline is limited by that of the ingestion phase, it is critical to satisfy these performance goals. However, state-of-art data ingestion systems such as Apache Kafka build on static stream partitioning and offset-based record access, trading performance for design simplicity. In this talk we introduce KerA, a data ingestion framework that alleviate the limitations of state-of-art thanks to a dynamic partitioning scheme and to lightweight indexing, thereby improving throughput, latency and scalability. Experimental evaluations show that KerA outperforms Kafka up to 4x for ingestion throughput and up to 5x for the overall stream processing throughput. Furthermore, they show that KerA is capable of delivering data fast enough to saturate the big data engine acting as the consumer.
How can approaches for scalable stream processing such as KerA be combined with in situ/in transit data processing architectures to address the needs of application scenarios combining HPC and data analytics?
More on resilience techniques.
Collaborate on resilience and scheduling problems.
In this presentation we will introduce different optimal algorithms for scheduling adjoint chains and adjoint multi-chains on general multi-level memory architectures. An extension to more general graphs is needed.
An extension to general adjint graphs of the presented algorithms can be a collaboration opportunity. In some context, the memory storing the checkpoints can fail and the need to design resilient versions of the presented algorithms arises.
Contrary to post-hoc MD data analytics that uses centralized data analysis (i.e., first generates and saves all the trajectory data to storage and then relies on post-simulation analysis), we extract MD data from the simulation as they are generated, analyze the data, and annotate MD outputs to drive the next steps in increasingly complex MD workflows. We extract data from the simulation by augmenting Plumed, a widely used open source MD analysis library that is designed to be a universal plugin for several popular MD codes. We integrate simulation and analytics on top of DataSpaces, an open source library that provides a shared-space abstraction for simulation and analytics applications to share data based on a tuple-space model. This strategy allows us to run structural analysis of MD frames agnostically, without modifications to the MD, data management (middleware), and analytics.
The modular design of our software framework allows us to mix and match different types of MD simulations, middleware, and analysis to build complex workflows. A near-to-optimal setting for simulation, middleware, and analytics can minimize data movement, execution time, and energy usage. However, such setting needs to be determined. This talk want to be a platform to brainstorm techniques and methodologies for the search of such a setting, independently from the type of molecular systems our MD simulations are studying.
1) Even though Plumed makes our in-situ analytics method broadly applicable to a variety of MD codes, the list of currently supported MD codes is not exhaustive. Are there other methods to extract data from MD simulations without recompiling the MD source code when data brokers such as Plumed are non-viable because of licensing or simply unavailable? In fact, we will kill two birds with one stone if an alternate approach works without a data broker, making our methods even more broadly applicable to other simulation paradigms which share a similar workflow, and differs only in the data schema.
2) Given a type of MD simulation, middleware, and analytics, is it possible to predict the near-optimal parameters using the performance data generated from this work?
3) Our work currently focuses on execution time and memory usage as performance metrics. Lack of fine grain control over energy measurement has limited our ability to use energy usage as a performance metric. How can we measure the energy usage associated with inter-node or intra-node data movement or in-situ analysis processes that may be operating on dedicated cores?
Contrary to post-hoc MD data analytics that uses centralized data analysis (i.e., first generates and saves all the trajectory data to storage and then relies on post-simulation analysis), we extract MD data from the simulation as they are generated, analyze the data, and annotate MD outputs to drive the next steps in increasingly complex MD workflows. We extract data from the simulation by augmenting Plumed, a widely used open source MD analysis library that is designed to be a universal plugin for several popular MD codes. We integrate simulation and analytics on top of DataSpaces, an open source library that provides a shared-space abstraction for simulation and analytics applications to share data based on a tuple-space model. This strategy allows us to run structural analysis of MD frames agnostically, without modifications to the MD, data management (middleware), and analytics.
The modular design of our software framework allows us to mix and match different types of MD simulations, middleware, and analysis to build complex workflows. A near-to-optimal setting for simulation, middleware, and analytics can minimize data movement, execution time, and energy usage. However, such setting needs to be determined. This talk want to be a platform to brainstorm techniques and methodologies for the search of such a setting, independently from the type of molecular systems our MD simulations are studying.
1) Even though Plumed makes our in-situ analytics method broadly applicable to a variety of MD codes, the list of currently supported MD codes is not exhaustive. Are there other methods to extract data from MD simulations without recompiling the MD source code when data brokers such as Plumed are non-viable because of licensing or simply unavailable? In fact, we will kill two birds with one stone if an alternate approach works without a data broker, making our methods even more broadly applicable to other simulation paradigms which share a similar workflow, and differs only in the data schema.
2) Given a type of MD simulation, middleware, and analytics, is it possible to predict the near-optimal parameters using the performance data generated from this work?
3) Our work currently focuses on execution time and memory usage as performance metrics. Lack of fine grain control over energy measurement has limited our ability to use energy usage as a performance metric. How can we measure the energy usage associated with inter-node or intra-node data movement or in-situ analysis processes that may be operating on dedicated cores?
The Holistic Measurement Driven System Assessment (HMDSA) project is designed to enable maximum science production for large-scale high performance computing (HPC) facilities, independent of major component vendor, and within budget constraints of money, space, and power. We accomplish this through development and deployment of scalable, platform-independent, open-source tools and techniques for monitoring, coupled with runtime analysis and feedback, which enables highly efficient HPC system operation and usage and also informs future system improvements.
We take a holistic approach through
* Monitoring of all performance impacting information (e.g., atmospheric conditions, physical plant, HPC system components, application resource utilization) with flexible frameworks for variable fidelity data collection to minimize application and/or system overhead,
* Developing scalable storage, retrieval, and run-time analytics to provide identification of performance impacting behaviors,
* Developing feedback and problem (e.g., faults, resource depletion, contention) mitigation strategies and mechanisms targeting applications, system software, hardware, and users.
How to handle immense amounts of data, flexible storage and access methods, application of intelligent agents to help assess system and application performance without needing to explicitly profile and/or instrument applications.
HPC systems have seen tremendous improvements in computational power, advances unmatched by communication capabilities, leading to a substantial imbalance between computation and communication power. Thus, data movement has become a new bottleneck in some large scale applications. With this in mind, It is essential for communication engine such as MPI to perform as efficiently as possible. This poster focuses on the study and various improvements for multithread communication and collective operations in Open MPI and how it performs in the modern HPC systems.
Advantage of the new scheme, how difficult to adopt it and what kind of application will benefit from this?
I will present an implementation of GPU convolution that favors coalesced accesses. Convolutions are the core operation of deep learning applications based on convolutional neural networks. Current GPU architectures are typically used for training deep CNNs, but some state-of-the-art implementations are inefficient for some commonly used network configurations. I will discuss experiments that used our new implementation, which yielded notable performance improvements — including up to 2.29X speedups — in a wide range of common CNN configurations.
I will be asking for potential collaborators willing to try our implementations in their CNNs.
I will be also interested in hearing from CNN architectures experiencing poor performance to devise new optimization opportunities.
The presentation will give overview of data-Sparse problems and sparse formats in the context of graphs, databases, and FEM methods. Together with the algorithm design and performance engineering for sparse operations on GPUs with the Ginkgo open source linear algebra library and its software sustainability aspects.
Optimal data-sparse methods and sparse formats unification: graphs, databases, and FEM? Productivity aspects in algorithm design for optimal performance in engineering sparse operations on GPUs? Unified GPU programming solutions between CUDA, OpenMP, and OpenACC? Is CUDA Managed Memory for sparse solvers?
Classification schemes for Polar Stratospheric Clouds will be presented. Methods used for feature reduction include autoencoder and kernel PCA have been used. Comparison with previous results will be provided to assess the prediction performances.
Numerical computation efficiency of the algorithm: what can be done?
High-end computing relies increasingly on machines with large numbers of GPUs. At the same time, GPU kernels are commonly produced in the process of automated software tuning. The Benchtesting OpeN Software Autotuning Infrastructure (BONSAI) project provides a software infrastructure for deploying large performance tuning sweeps to supercomputers. BONSAI allows for parallel compilation and benchmarking of a large number of GPU kernels by dynamically scheduling work to a large numbers of distributed-memory nodes, with multiple GPU accelerators. In this talk we outline BONSAI design and highlight its main capabilities.
collaboration opportunities:
* GPU kernel development - use BONSAI for making large GPU kernel autotuning sweeps
* machine learning - use large datasets produced by BONSAI tuning sweeps
Within the Helmholtz Analytics Framework there is a need for the analysis of data from large to extreme sizes. The current tools for this are run on either high performance computing (HPC) systems or graphics processing unit (GPU) based systems. While there are tools intended for the use of both of these systems concurrently, their communication methods are not designed for HPC environments. The goal of the Helmholtz Analytics Toolkit (HeAT) is to fill this gap.
The HeAT framework is being developed to use both HPCs and GPUs with MPI communication to analyze extremely large datasets. It is build on the concept of using multiple linked PyTorch tensor objects on multiple nodes to distributed the data as well as the computations. In the future, the HeAT framework will include multiple machine learning algorithms as well as multiple traditional data analysis tools.
What are some possibilities for efficient eigenvalue solvers for distributed datasets?
The data is not split into blocks here but rather split along one dimension, can a new parallel matrix multiplication method be divised for this which is less communication intensive than the more traditional approaches?
DeepHyper is a Python package that comprises two components: 1) Neural architecture search is an approach for automatically searching for high-performing the deep neural network architecture. 2) Hyperparameter search is an approach for automatically searching for high-performing hyperparameters for a given deep neural network. DeepHyper provides an infrastructure that targets experimental research in neural architecture and hyperparameter search methods, scalability, and portability across HPC systems. It comprises three modules: benchmarks, a collection of extensible and diverse DL hyperparameter search problems; search, a set of search algorithms for DL hyperparameter search; and evaluators, a common interface for evaluating hyperparameter configurations on HPC platforms.
Machine learning applications
The talk will discuss the ongoing work at Jülich that aims at bridging HPC, CFD, and Rhinology to realize HPC-supported personalized medicine.
In-situ computational steering; Interactive supercomputing; Realistic real-time CFD simulations
We will present MagmaDNN - a high-performance data analytics library for manycore GPUs and CPUs. MagmaDNN is a collection of high-performance linear algebra for deep learning network computations and data analytics.
MagmaDNN is open source so we need collaborators for the development. Is this just another new library for DNN - or can we bring more in terms of HPC. Current frameworks are not that well optimized, while the core of the computations needed for AI, DNN, and big-data analytics are linear algebra. As experts in the field, can we collect the routines needed in a framework and find applications to use it? Besides development and high-performance, open research questions are how to design DNNs, tune hyperparameters, accelerated solvers, how to make sense of how they work, etc.
The objective of the Software for Linear Algebra Targeting Exascale (SLATE) project is to provide fundamental dense linear algebra capabilities to the US Department of Energy and to the high-performance computing (HPC) community at large, and ultimately to replace the venerable Scalable Linear Algebra PACKage (ScaLAPACK). SLATE is being developed from the ground up, with focus on scalability and support for hardware accelerators. This talk highlights SLATE's main design principles.
collaboration opportunities:
* science apps - implements building blocks for many science apps
* benchmarking - contains a GPU-accelerated implementation of the HPL benchmark
* scheduling research - many routines make good mini-apps for scheduling research
We will present some recently developed mixed-precision solvers that use FP16 arithmetic. Current hardware, e.g., GPUs with Tensor Cores, started supporting accelerated arithmetic for low precision arithmetic for use in AI applications. This hardware is readily available in current extreme-scale systems (like Summit) and is of interest to use it in general solvers and applications, beyond deep learning networks.
Are there other mixed-precision algorithms of interest that can be collected in a mixed-precision numerical library? What applications need and can benefit from mixed-precision computations? Collaboration is needed for the development of software and research questions related to accuracy and performance.
The explosion of hardware-parallelism inside a single node asks for a shift in the programming paradigms and disruptively-different algorithm designs that allow to exploit the compute power available in new hardware technology. We propose a parallel algorithm for computing a threshold incomplete LU (ILU) factorization. The main idea is to interleave an element-parallel fixed-point iteration that approximates an incomplete factorization for a given sparsity pattern with a procedure that adjusts the pattern to the problem characteristics. We describe and test a strategy for identifying nonzeros to be added and nonzeros to be removed from the sparsity pattern. The resulting pattern may be different and more effective than that of existing threshold ILU algorithms. Also in contrast to other parallel threshold ILU algorithms, much of the new algorithm has fine-grained parallelism.
Optimal precision for FEM? Sparsity pattern exploitation and its limits? Optimal variant of ILU schemes?
Over the last years, we have observed a growing mismatch between the arithmetic performance of processors in terms of the number of floating point operations per second (FLOPS) on the one side, and the memory performance in terms of how fast data can be brought into the computational elements (memory bandwidth) on the other side. As a result, more and more applications can utilize only a fraction of the available compute power as they are waiting for the required data. With memory operations being the primary energy consumer, data access is pivotal also in the resource balance and the battery life of mobile devices. In this talk we will introduce a disruptive paradigm change with respect to how scientific data is stored and processed in computing applications. The goal is to 1) radically decouple the data storage format from the processing format; 2) design a "modular precision ecosystem'' that allows for more flexibility in terms of customized data access; 3) develop algorithms and applications that dynamically adapt data access accuracy to the numerical requirements.
Optimal precision for iterative solvers? A priori and a posteriori precision selection.
Mesh partitioning is a potential bottleneck on extreme scale systems. Moreover static partitions evaluated a priory can become inefficient on large and heterogeneous systems with rather unpredictable performance, specially for complex simulations with heterogeneity on the discretization or on the physics evolution across the domain. We have been working on dynamic mesh partition optimization and we have tested it on different heterogeneous systems such as Piz Daint from CSCS or the CTE P9 MareNostrum IV cluster form the BSC. We aim to share our experience and we seek for collaboration on the solution of different issues regarding parallel partitioning, load balancing or performance implications of the partition type (graph based / geometrical based) on the performance of the different phases of the simulation code.
We are interested on scalable mesh partitioning; comparison of different partitioners; definition and convenient partitions in heterogeneous systems; trade off between load balancing and communication reduction; impact of the partition on the linear solvers.
The contributions of this short talk are in the growing high performance computing (HPC) field of data analytics and are at the cross-section of empirical collection of performance results and the rigorous, reproducible methodology for their collection. Our work in the field of characterizing power and performance in data-intensive applications using MapReduce over MPI expands traditional metrics such as execution times to include metrics such as power usage and energy usage associated with data analytics and data management. We move away from the traditional compute-intensive workflows towards data-intensive workloads with a focus on MapReduce programming models as they gain momentum in the HPC community. The talk focuses on the quantitative evaluation of performance and power usage over time in data-intensive applications that use MapReduce over MPI. We identify ideal conditions for execution of our mini-applications in terms of (1) dataset characteristics (e.g., unique words in datasets); (2) system characteristics (e.g., KNL and KNM); and (3) implementation of the MapReduce programming model (e.g., impact of various optimizations). Preliminary results presented in the talk illustrate the high power utilization and runtime costs of data management on HPC architecture.
Open questions we would like to discuss at the workshops are as follows:
(1) how far are our observations from a general principle relating power cap and performance in data-intensive applications?
(2) is there any way of reducing data movement (and power usage) other than the combiner techniques?
(3) how can we tune the settings of the underlying MapReduce framework during runtime to extend identified "sweet spot" regions (i.e., regions of minimum runtime and power usage)?
The EU H2020 Centre of Excellence POP (Performance Optimisation and Productivity), with both BSC and JSC as partners, got more funding to operate for additional 3 years (Dec 2018 to Nov 2021). We will quickly highlight important new aspects of the CoE relevant to JLESC, namely the work on standard performance assessment metrics and a associated methodology and the co-design data repository.
Contributions/collaboration on both the standard performance assessment metrics /methodology and the co-design data repository
Modern CPUs offer a plethora of different native events for monitoring hardware behavior. Some map readily to concepts that are easily understood by performance analysts. Many others however, involve esoteric micro-architectural details, making it difficult even for performance experts to fully understand how to take advantage of them, or which ones to use for measuring desired behaviors and exposing pathological cases. In this talk we will outline our work that aims to shed light on these obscure corners of performance analysis.
Which part of the architecture do you think is the most important to monitor in order to assess the performance of your code, and which events would you use to do so?
In this talk, I will be discussing the capabilities, usage, and application of Software-based Performance Counters (SPCs) within Open MPI. These SPCs expose otherwise inaccessible internal Open MPI metrics through performance variables in the MPI Tools Information Interface. Enabling SPCs in Open MPI adds minimal overhead to MPI applications and can provide lower level information than the existing user-level PMPI interface. I will illustrate how these counters allow users to identify performance bottlenecks and areas for improvement in both user code and the MPI implementation itself.
This work lends itself to collaboration with performance tool developers and application developers looking to analyze the performance of the MPI portions of their code.
In this talk, I will present our design space exploration study that considers the most relevant architecture design trends we are observing today in HPC systems. I will discuss performance and power trade-offs, when targeting different HPC workloads, and my take about the main issues for the advance of academic research in this area.
The design space of HPC architectures is widening: accelerators, high number of cores and nodes, new memory technologies. How do we integrate all the necessary pieces to simulate such systems at reasonable speed and accuracy?
We are exploring next generation high performance computing architecture for Post-Moore Era. We are also trying to develop a methodology to estimate and analyze performance of such guture generation architectures.
Development of Custom Computing System with the-state-of-the-art FPGAs
What kind of computing kernels and communication/synchronization should we offload to a tightly-coupled FPGA cluster? We are looking for such killer computing and communication for FPGAs!
Exascale and beyong architectures are expected to become more complex, exhibiting a deepening memory hierarchy and more complex topologies.
Executing HPC applications on such platforms will require careful allocation of resources. In particular, advanced workloads like ensemble simulations or coupled codes might also want to use either co-scheduling or dynamic resource management for better efficiency.
To address this challenge, we are exploring the use of dynamic resource control schemes to map complex workloads unto a node, using performance monitoring to dynamically redistribute resources among components. For resource partitioning and allocation scheme, we use a special kind of containers called "resource slices".
These slices do not perform namespacing or virtualization, but only take care of resource control, using the available operating system interfaces on Linux (cgroups, resctrl). So far, we can perform CPU and NUMA node allocation, as well as cache bandwidth control.
While we have some early users with ensemble workloads, we are looking for more collaborators willing to explore:
- other types of workloads,
- workload-specific resource allocation schemes,
- extensions of our resource slices to other resources.
This work explores the potential usage of hybrid Cloud+Edge+HPC architectures for data analytics. We focus on the case of Pl@ntNet, an application that uses ML to identify and classify plants based on their pictures. Users have an active role in classification, as they need to validate the result proposed by the classification algorithms, while new expert-validated recognitions serve to continuously improve the underlying training mechanism. We propose an approach based on decentralizing part of the necessary computation necessary to classify plants from central processing (cloud) to the edge of the infrastructure. Namely, we investigate how local feature extraction and local fake image detection could improve the general application performance. In addition, we tackle the problem of reducing the time taken to train the neural network used by the classifier, so that new plant species identified by the users could be added faster to the ML model using HPC capabilities.
There are 2 open questions:
1- What are the gains in terms of performance (classification latency) and data transmission costs obtained by performing part of the classification locally?
2- How resource-consuming is neural network continuous learning? Would it be possible to perform part of the processing on the Edge of the infrastructure?
Over the last decades, distributing high performance applications has been made easier thanks to outstanding advances in programming languages, runtime systems, parallel libraries, and load balancing algorithms. But even if it has been made easier, it still remains, for the most part, a nightmare. In this talk I will discuss how the problem can be attacked, learning from existing practices and leveraging new programming approaches. In particular, as MPI and OpenMP have proven that standardization of interfaces is a successfull approach, I will discuss the possibility of standardizing load balancing abstractions. I will present the implications in terms of software architecture, and how such an effort could benefit the entire HPC community, from application writers to algorithm developers.
How to abstract load balancing from runtime systems and applications? How to express load balancing abstractions as types? How to leverage concept-based programming to design a load balancing library? Could C++ executors be used to provide load balancing capabilities?
Spatial soil moisture data are relevant to environmental sciences (e.g., ecological niche modeling, carbon monitoring systems, and other Earth system models) and precision agriculture (e.g., optimizing irrigation practices and other land management decisions). The primary source of soil moisture data over large areas is satellite-borne, radar-based remote sensing technology. Though produced with daily measurements, a major downfall of satellite soil moisture datasets is their coarse resolution, often too coarse for local needs (e.g., precision agriculture). To remedy this, we are leveraging machine-learning techniques coupled with other sources of environmental information (e.g., topography or weather) that is related to soil moisture and available at a finer spatial resolution.
Our collaboration between computer scientists at the University of Tennessee and soil scientists at the University of Delaware is developing SOMOSPIE—a modular SOil MOisture SPatial Inference Engine—for generating soil moisture information at finer resolution than available from satellite data. The engine consists of modular stages for processing spatial data, generating models with machine learning techniques, and analyzing the resulting predictions. The initial stage curates the available remotely sensed soil moisture measurements and ancillary environmental data for the desired temporal and geographic region. For the central stage, we are utilizing traditional methods such as k-nearest neighbors and random forest regression, as well as novel techniques such as HYbrid Piecewise POlynomials (HYPPO). Finally, networks of ground sensors provide us with “ground truth†by which to validate both our soil moisture predictions and the methods by which we produce the predictions.
(1) Are there other areas with a similar objective--downscaling one dataset using related data available at a higher resolution--for which existing research efforts and our efforts may be mutually informative?
(2) Aside from computing basic statistics (e.g., correlation) between soil moisture and related variables, what computational methods exist for identifying relationships that can improve model generation?
(3) There are inherent mathematical limitations to predicting soil moisture at a specific point using a summary value for a surrounding area (i.e., a large pixel from coarse, remotely sensed data). Might iterative bootstrapping methods reduce the influence of the coarseness of source data?
We will present challenges and some current results in the development of high-performance FFT library for large scale heterogeneous systems. Our goal is to provide a sustainable high-performance FFT library for Exascale platforms that leverages the large investments in FFT software by the broader HPC community.
Of particular interest are links to applications - what applications use FFTs, how, and can we use application-specific optimizations, while still providing a library with consistent API. Furthermore, FFT has particular computational motifs that can be used elsewhere. Of interest is high-performance MPI for all-to-all using GPU direct communications. These are used in global "matrix transpositions". Are there other applications that need these building blocks and is it of interest to expose them to users with proper user interface?
Determining the presence of “galaxies” in an n-body simulation is usually based on some form of the Friends-of-Friends algorithm, which connects close groups of particles together into single components. This algorithm can be divided into three distinct phases: (a) Identifying the connected components, (b) Component labelling, and (c) Pruning insignificant components. Identifying connected components of a graph is a well researched problem, and can either be solved by using a graph traversal algorithm or by employing disjoint-set data structures. Distributed parallel versions of Breadth First (BFS) or Depth First Search (DFS) can be used to efficiently traverse a single component. However, the parallel BFS needs to be executed once per connected component, and these executions cannot be overlapped. We explore the design of a fully distributed asynchronous union-find algorithm where the edge set is processed in parallel across processes without any barriers across them.
As the number of vertices we process for astrophysical simulations are easily a billion plus, the number of union-find operations, and in turn the number of messages in flight can easily clog the network hurting the performance, and in many cases running out of memory. Hence, it will be critical to explore an effective way to throttle generation/identification of edges, e.g. via batching strategies. However, design of such strategies is challenging because it faces twin and opposite dangers of starving processors of work or swamping their memories, when the decision criteria is spread over all the processors. We develop effective strategies for this purpose in our work, and present results of this work-in-progress. The algorithm is general-purpose and has application in domains such as social network analsysis and other graph problems Collaborations from applications as well as computer scientists are welcome.
Although single-node shared memory solutions have been explored, as soon as the problem gets large enough not to fit on a single node, it becomes very difficult to get good performance. Some of the challenges include very fine-grained messaging, tradeoff between shared memory and message-based programming costs, load imbalances, and prioritization and scheduling among multiple competing computations in an asynchronous setting.
Modular Supercomputing has been recently proposed as an alternative design approach to address open challenges of the current computing cluster paradigm such as power consumption, resiliency, and concurrency. One available machine putting this Moduler Supercomputing design approach into practice is using a Cluster-Booster setup. The Cluster nodes have less cores, but higher frequencies and are used for the less scalable part of an application. While the Booster nodes have a higher core count, but slower frequencies and are intended for the more scalable part of the application. Since Modular Supercomputing may be one of the possible solutions towards Exascale-enabled HPC Systems, it is important to understand which would be the typical applications benefiting from a modular design. In this regards, there are several important open questions: 1) how should one partition an existing parallel application so that each part is executed across distinct modules? 2) Is it possible to predict which part of an application should run on which module? Further is it possible to create a feature-based prediction for the partitioning? These features include the scalability, the memory footprint, and the amount of data that is needed to be exchanged between different parts of the application. One clue element common to all questions is to carry on software optimizations specifically aimed to modular execution but that are still performance portable. One of the most important aspects of such optimizations is the performance on the communication between the different modules. Topology-aware MPI collectives are proposed to fasten the communication between modules by using topology information that are available at run time.
- General Guidelines for porting applications for Modular Supercomputing
- Feature-based partitioning prediction
- Optimizing MPI collectives using topology-awareness
As MPI comes to term with shared memory in presence of large multicore nodes, one approach for effective parallel programming without resorting to multi-model programming represented by MPI+X approaches is to support multiple ranks within a single process. This is useful even in conjunction with MPI + X programming, as indicated by the partial support in the community for the “endpoints” proposal. In Adaptive MPI, developed in our research group at Illinois, virtualization and overdecomposition are necessary for the purpose of supporting key features such as dynamic load balancing, malleability and fault tolerance, in addition to adaptive overlapping of communication and computation. The problem then becomes how to support multiple ranks,each of which looks like a logical process to the programmer, in a single physical process, which allows sharing of memory in effective ways among ranks. Additional challenges arise when we wish to allow migration of these ranks across physical hosts. We will enumerate multiple issues and challenges that arise in this context and multiple approaches that are being explored. These include the old isomalloc approach for migration developed in France and in regular use in AMPI, the process-in-process approach currently being explored by Prof. Hori and collaborators, as well as a set of techniques being explored for Adaptive MPI. I will present these and seek ideas, synergies and collaborations among JLESC researchers.
Esoteric system level issues arise in supporting multiple virtual ranks within a process. There are compiler-level issues that need unified approaches to privatization of a correct subset of global variables. OS support regarding MMAP and virtual memory reservations come up in trying to support migration. We are hoping that a comprehensive review of these challenges may generate solutions and suggestions from the broader JLESC community.
We evaluated the performance and the power consumption for multiple vector lengths on some benchmarks.
We will evaluate the effect of vector length in the performance and the energy consumption.
Tasks are a good support for composition. During the development of a high-level component model for HPC, we have experimented to manage parallelism from components using OpenMP tasks. Since version 4-0, the standard proposes a model with dependent tasks that seems very attractive because it enables the description of dependencies between tasks generated by different components without breaking maintainability constraints such as separation of concerns. The paper presents our feedback on using OpenMP in our context. We discover that our main issues are a too coarse task granularity for our expected performance on classical OpenMP runtimes, and a harmful task throttling heuristic counter-productive for our applications. We present a completion time break-down of task management in the Intel OpenMP runtime and propose extensions evaluated on a testbed application coming from the Gysela application in plasma physics.
OpenMP runtime limitations wrt task granularity. Higher level data flow model than just plain task graphs.
With the objective of extracting as much performance as possible from the exascale machines, the traditional mix of paradigms for shared memory, distributed memory and accelerators will struggle to achieve near-peak performance. Dataflow task-based runtime systems are a natural solution; abstracting architecture specific APIs, and removing excessive synchronizations, while proposing different domain specific interfaces. Programming within such a model can prove to be a challenge, and even have inherent scalability issues. We will explore the possibility of having an easy to program middle-ground paradigm that delivers enough information to the runtime to mitigate scaling issues?
Exchange ideas with runtimes and applications on what information can be shared with the domain specific interface to help the runtime improve its scheduling of tasks and communication.
Current communication libraries do not match the needs of asynchronous task models, such as PaRSEC or Legion and of graph analytic frameworks; also, they cannot take good advantage of forthcoming smart NICs. We present our current design for LCI, a Lightweight Communication Library and current work to use it to support PaRSEC.
feedback on current design will be useful. We have started (currently unfunded) collaboration with George Bosilca at UTK and propose t o make this collaboration a JLESC project.
Starting from current generation and forward to next generations, compute nodes embed a complex addressable memory hierarchy. In order to extract performance from this memory, it is required to optimize data locality through wise data allocation and on point data migrations. Though it is not desired to expose such a complexity to final application developers, building blocks for operating the memory hierarchy will need to be exposed with a consistent and convenient interface to enable portable and efficient runtime optimizations. AML memory library is being developed as a part of Argo project, funded by the Exascale Computing Project, to define and implement such building blocks. So far, it is envisioned in the library that explicit memory management can be done through three main components:
cross memory/device data migration,
explicit userdata layout and userdata access patterns,
and hardware locality.
While the library is under development and integration into real world applications, I am looking for collaborations to design runtime optimizations enabled by these blocks, such as automatic prefetching or data packing.
Hopefully such building blocks will enable more extensive works on existing open questions such as:
** Optimal static allocation under capacity, bandwidth and latency constraints.
** Automatic data migration (to fast scratchpads or closer memory).
** Coupled management of threads and data ...
Qu est-ce qu'on a comme leviers pour encourager les collaborations a part un discourt convaincant?
SIMD operations on irregular data structures may be difficult if the computational operations differ between the data elements. But often the compute kernel is repeatedly called for many different inputs and could easily take advantage of SIMD operations if the input is provided in AoSoA format.
We will show how a list of arrays can be `transposed' (either on-the-fly or as a copy operation) into AoSoA format (and vice versa) using SIMD shuffle operations within C++. This approach may be helpful to accelerate some kernels if (permanently) changing the underlying memory layout is not possible or beneficial.
Which groups are facing similar problems?
Could they benefit from having this available as a separate library?
Which other SIMD architectures are of interest?
In this talk we present the Task-Aware MPI (TAMPI) library that extends the functionality of standard MPI libraries by providing new mechanisms for improving the interoperability between parallel task-based programming models, such as OpenMP or OmpSs-2, and both blocking and non-blocking MPI operations. By following the MPI Standard, programmers must pay close attention to avoid deadlocks that may occur in hybrid applications (e.g., MPI+OpenMP) where MPI calls take place inside tasks. This is given by the out-of-order execution of tasks that consequently alter the execution order of the enclosed MPI calls. The TAMPI library ensures a deadlock-free execution of such hybrid applications by implementing a cooperation mechanism between the MPI library and the parallel task-based runtime system.
TAMPI supports two different modes. The blocking mode targets the efficient and safe execution of blocking MPI operations (e.g., MPI_Recv) from inside tasks, while the non-blocking mode focuses on the efficient execution of non-blocking or immediate MPI operations (e.g., MPI_Irecv), also from inside tasks.
We are looking for pure or hybrid MPI apps or mini-apps to test our TAMPI library and get feedback to improve it.
Online black-box optimization of the power-performance tradeoff via hardware actuators/sensors and lightweight application instrumentation.
What are the possible gains along the power-performance curve on production applications? Which feedback is useful, and what actuators have the most impact? If sufficient gains are possible, can they be achieved by an online control policy? We are currently working with Intel architectures, which other platforms could such work apply to?
This briefly introduces several use cases where checkpointing techniques can help capture critical application data structures for later re-use. It discusses both the specific requirements and constraints that checkpointing has in these use cases, as well as the potential benefits
What are the key requirements for checkpointing when it is used beyond resilience? How do checkpointing techniques designed for resilience need to change to meet these requirements and optimize performance and scalability? If your group needs checkpointing beyond resilience, the speak would be happy to lean more about the specific use case and collaboration opportunities.
As supercomputers have grown in scale to meet computing demands, their mean time between failure has declined. MPI is evolving to include concepts that enable continued operation despite failures. A recent push toward non-blocking, configurable, and asynchronous recovery tries to address issues with composing modular recovery procedures and amortize MPI repair cost with other recovery aspects. The goal is to enable application components to control the scope and timing at which errors are reported and permit an asynchronous MPI recovery operation where multiple components recovery procedures can overlap. Recent advances have demonstrated the feasibility of the approach, and open a new landscape for resilient algorithms and application demonstrators that can operate in an asynchronous manner.
Opportunities for developing non-blocking recovery algorithms and global scope error reporting (implementation level) and application use cases.
As HPC systems grow larger and include more hardware components of different types, the system's failure rate becomes higher. Efficient fault tolerance techniques are essential not only to ensure the execution completion but also to save energy. In many cases, failures have a localized scope and their impact is restricted to a subset of the resources being used. In MPI applications, combining Checkpoint/Restart and message logging enables the localized rollback recovery of only the processes affected by a failure, which heavily reduces the recovery overhead. Using MPI remote memory access operations and performing a custom replay of collective operations lowers the synchronicity of the replay and can contribute towards minimizing the overall failure overhead and energy consumption.
The open questions will be focused on how to enable a receiver-driven replay of communications in MPI applications.
Deep neural networks (DNNs) have been quickly and broadly exploited to improve the data analysis quality (such as classification accuracy) in many complex science and engineering applications. Today’s DNNs are becoming deeper and wider because of increasing demand on the analysis quality and more and more complex applications to resolve. The wide and deep DNNs, however, require large amounts of resources (such as memory, storage, and I/O), significantly restricting their utilization on resource-constrained systems. We propose DeepSZ: an accuracy-loss bounded neural network compression framework, which involves four key steps: network pruning, error bound assessment, optimization for error bound configuration, and compressed model generation, featuring a high compression ratio and low encoding time. Experiments show that DeepSZ can compress AlexNet and VGG-16 on the ImageNet dataset by a compression ratio of 46× and 116×, respectively, and compress LeNet-300-100 and LeNet-5 on the MNIST dataset by a compression ratio of 57× and 56×, respectively, with only up to 0.3% loss of inference accuracy.
How to leverage lossy compression in different big-data related challenging research issues, such as checkpointing, communication, and storage performance? How to improve lossy compression quality for specific applications?
An in-depth understanding of the failure features of HPC jobs in a supercomputer is critical to the large-scale system maintenance and improvement of the service quality for users. In this paper, we investigate the features of hundreds of thousands of jobs in one of the most powerful supercomputers, the IBM Blue Gene/Q Mira, based on 2001 days of observations with a total of over 32.44 billion core-hours. We study the impact of the system's events on the jobs' execution in order to understand the system's reliability from the perspective of jobs and users. The characterization involves a joint analysis based on multiple data sources, including the reliability, availability, and serviceability (RAS) log; job scheduling log; the log regarding each job's physical execution tasks; and the I/O behavior log. We present 22 valuable takeaways based on our in-depth analysis. For instance, 99,245 job failures are reported in the job-scheduling log, a large majority (99.4%) of which are due to user behavior. The best-fitting distributions of a failed job's execution length (or interruption interval) include Weibull, Pareto, inverse Gaussian, and Erlang/exponential, depending on the types of errors (i.e., exit codes). The RAS events affecting job executions exhibit a high correlation with users and core-hours and have a strong locality feature. In terms of the failed jobs, our similarity-based event-filtering analysis indicates that the mean time to interruption is about 3.5 days.
In Checkpoint/Restart (C/R), Finding optimal checkpoint interval is important for reducing I/O workloads while maximizing resiliency of application executions. Typically we find the optimal checkpoint using stochastic models. With emergence of more complicated checkpointing strategies in HPC (Multi-level, complicated eraser encoding etc.), modeling these approaches is becoming very difficult. Another another approach is to reply on simulation techniques to find the optimal checkpoint interval. However, simulation is unacceptably time-consuming for practical use where application developers would like to know the optimal interval on a job submission. In this short talk, we introduce a checkpoint interval optimization technique with AI.
What common checkpoint/restart approaches are in practice ?
What resiliency strategies should be considered, Multi-level C/R, Asynchronous C/R, failure prediction, erasure encoding and others ?
Collaboration on C/R simulation framework development.
Title | Topic | Presenter | Authors | Abstract |
---|---|---|---|---|
Performance and Power Consumption Analysis of Arm Scalable Vector Extension with gem5 Simulator | Performance Tools | Tetsuya Odajima | Tetsuya Odajima, Yuetsu Kodama, Miwako Tsuji, Mitsuhisa Sato (RIKEN) | |
SLATE: Software for Linear Algebra Targeting Exascale | Numerical Methods | Jakub Kurzak | Jakub Kurzak, University of Tennessee Mark Gates, University of Tennessee Ali Charara, University of Tennessee Asim YarKhan, University of Tennessee Jack Dongarra, University of Tennessee | |
BONSAI: Benchmarking OpeN Software Autotuning Infrastructure | Performance Tools | Jakub Kurzak | Jakub Kurzak, University of Tennessee Mike Tsai, University of Tennessee Mark Gates, University of Tennessee Jack Dongarra, University of Tennessee | |
PaRSEC - a data-flow task-based runtime | Programming Languages and Runtimes | Yu Pei, Qinglei Cao | Reazul Hoque, Yu Pei, Qinglei Cao - University of Tennessee, Knoxville | |
Extending Open MPI with Tool and Resilience Support | Parallel Programming models and runtime; Performance tools; Resilience | David Eberius, Dong Zhong | David Eberius, Dong Zhong (Innovative Computing Laboratory, University of Tennessee) | |
PRIONN: Predicting Runtime and IO using Neural Networks | I/O, Storage and In-Situ Processing | Michael Wyatt | Michael Wyatt (UTK), Stephen Herbein (LLNL), Todd Gamblin (LLNL), Adam Moody (LLNL), Dong H Ahn (LLNL), Michela Taufer (UTK) | |
Modeling Record-and-Replay for Nondeterministic Applications on Exascale Systems | Resilience, Performance tools, Applications and mini-apps | Dylan Chapp | Dylan Chapp (University of Delaware, University of Tennessee, Knoxville), Danny Rorabaugh (University of Tennessee, Knoxville), Michela Taufer (University of Tennesee, Knoxville) | |
Performance improvements in Open MPI | Programming Languages and Runtimes | Xi Luo, Thananon Patinyasakdikul | Xi Luo, Thananon Patinyasakdikul (UTK) | |
Pseudo-Assembly Programming for Batched Matrix Factorization | scientific simulation, high-order PDE methods | Mike Tsai (UTK) | Mike Tsai (UTK) Piotr Luszczek (UTK) Jakub Kurzak (UTK) Jack Dongarra (UTK) | |
Improve Exascale IO via an Adaptive Lossy Compressor | Big Data, I/O and in-situ visualization | Xin Liang | Xin Liang (UCR), Sheng Di (ANL), Sihuan Li (UCR), Dingwen Tao (UA), Bogdan Nicolae (ANL), Zizhong Chen (UCR), Franck Cappello (ANL) | |
Towards SDC Resilient Error Bounded Lossy Compressor | Resilience, ABFT (Algorithm Based Fault Tolerance), Lossy compression | Sihuan Li | Sihuan Li, UC Riverside; Sheng Di, ANL; Xin Liang, UC Riverside; Zizhong Chen, UC Riverside; Franck Cappello, ANL | |
Towards Unified Tasking on CPUs and GPUs | Parallel Programming Models | Laura Morgenstern | Laura Morgenstern (JSC), Ivo Kabadshow (JSC) | |
Holistic Measurement Driven System Assessment | Architecture, Resilience | William Kramer | PI: Bill Kramer (NCSA/UIUC) NCSA: Greg Bauer Brett Bode Jeremy Enos Aaron Saxton Mike Showerman UIUC: Saurabh Jha (CS/CSL) Ravi Iyer (ECE/CSL) Zbigniew Kalbarczyk (ECE/CSL) SNL: Jim Brandt Ann Gentile | |
Modeling HPC applications for in situ Analytics | I/O, Storage and In-Situ Processing | Valentin HONORE | Guillaume AUPY, Brice GOGLIN, Valentin HONORE (Inria, LaBRI, Univ. Bordeaux, 33400 Talence, France) - Bruno RAFFIN (Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, 38000 Grenoble, France) | |
Enhancing Deep Learning towards Exascale with the DEEP-EST Modular Supercomputer Architecture | Parallel Programming models and runtime; Performance tools; | Sedona Rocco | Ernir Erlingsson, Gabriele Cavallaro, Morris Riedel, Rocco Sedona and Helmut Neukirchen (JSC) |
We examine the effect of vector length and the number of out-of-order resources in the performance and the runtime power consumption of benchmarks by using the gem5 processor simulator and McPAT framework. We evaluate the performance for multiple vector lengths using the gem5 processor simulator, which supports cycle-accurate out-of-order pipeline simulator. A long vector length improves peak computing performance because of an increase in the number of elements that can be executed in parallel, but it requires a wide die area. We also examine the power consumption for multiple vector lengths using the McPAT framework, which is the simulator for processor area and power consumption.
The objective of the Software for Linear Algebra Targeting Exascale (SLATE) project is to provide fundamental dense linear algebra capabilities to the US Department of Energy and to the high-performance computing (HPC) community at large, and ultimately to replace the venerable Scalable Linear Algebra PACKage (ScaLAPACK). SLATE is being developed from the ground up, with focus on scalability and support for hardware accelerators. This poster highlights SLATE's main design principles.
High-end computing relies increasingly on machines with large numbers of GPUs. At the same time, GPU kernels are commonly produced in the process of automated software tuning. The Benchtesting OpeN Software Autotuning Infrastructure (BONSAI) project provides a software infrastructure for deploying large performance tuning sweeps to supercomputers. BONSAI allows for parallel compilation and benchmarking of a large number of GPU kernels by dynamically scheduling work to a large numbers of distributed-memory nodes, with multiple GPU accelerators. In this poster we outline BONSAI design and highlight its main capabilities.
PaRSEC is a data-flow architecture-aware task-based runtime system supporting applications running on heterogeneous distributed machines, that has been shown to provide excellent performance to many scientific applications. A modular design allows PaRSEC to offer several programming interfaces, and provide efficient support for Domain Specific Langages. In this poster we demonstrate one such high level interface – Dynamic Task Discovery, a more explicit low level interface – Futures and present recent work in task-based Data Redistribution, 2D Stencil, Low Rank Cholesky and their performance results.
This poster details the implementation and usage of two extensions to Open MPI, Software-based Performance Counters (SPCs) and Resilient PMIx Reference RunTime Environment (PRRTE). SPCs expose internal Open MPI performance information, that would otherwise be inaccessible, through the MPI_T interface. Resilient PRRTE provides an efficient runtime-level failure detection and propagation strategy targeting exascale systems. These extensions provide a more robust environment for Open MPI with additional tool support through the MPI_T interface and fault tolerance through Resilient PRRTE which supports ULFM.
For job allocation decision, current batch schedulers have access to and use only information on the number of nodes and runtime because it is readily available at submission time from user job scripts. User-provided runtimes are typically inaccurate because users overestimate or lack understanding of job resource requirements. Beyond the number of nodes and runtime, other system resources, including IO and network, are not available but play a key role in system performance. There is the need for automatic, general, and scalable tools that provide accurate resource usage information to schedulers so that, by becoming resource-aware, they can better manage system resources.
We tackle this need by presenting a tool for Predicting Runtime and IO using Neural Networks (PRIONN). PRIONN automates prediction of per-job runtime and IO resource usage, enabling IO-aware scheduling on HPC systems. The novelty of our tool is the input of whole job scripts into deep learning models that allows complete automation of runtime and IO resource predictions. We demonstrate the power of PRIONN with runtime and IO resource predictions applied to IO-aware scheduling for real HPC data. Specifically, we achieve over 75\% mean and 98\% median accuracy for runtime and IO predictions across 300,000 jobs from a real HPC machine. We combine our per-job runtime and IO predictions with queue and system simulations to predict future system IO usage accurately. We predict over 50\% of IO bursts in advance on a real HPC system.
Record-and-replay (R&R) techniques present an attractive method for mitigating the harmful aspects of nondeterminism in HPC applications (e.g., numerical irreproducibility and hampered debugging), but are hamstrung by two problems. First, there is insufficient understanding of how existing R&R techniques cost of recording responds to changes in application communication patterns, inputs, and other aspects of configuration, and to the degree of concurrency. Second, current R&R techniques have insufficient ability to exploit regularities in the communication patterns of individual applications.
We contend that it is crucial that the HPC community is equipped with modeling and simulation methodologies to assess the response of R&R tools, both in terms of execution time overhead and memory overhead, to changes in the configuration of the applications they monitor. We introduce a graph-based modeling methodology for quantifying the degree of nondeterminism in fixed applications across multiple runs and for comparing nondeterminism across multiple applications based on the varying graph structure of the modeled executions.
These models will enable scientists and application developers to make informed decisions about the scenarios in which R&R tools can be deployed cost-effectively, thus increasing their usability and utility in HPC as a whole. Moreover, this modeling effort will provide insights into the recording costs associated with the various communication patterns endemic to HPC applications, which in turn will enable the development of R&R tools that exploit these patterns to realize more compact representations of executions and thus reduced recording overhead.
HPC systems have seen tremendous improvements in computational power, advances unmatched by communication capabilities, leading to a substantial imbalance between computation and communication power. Thus, data movement has become a new bottleneck in some large scale applications. With this in mind, It is essential for communication engine such as MPI to perform as efficiently as possible. This poster focuses on the study and various improvements for multithread communication and collective operations in Open MPI and how it performs in the modern HPC systems.
While CUDA and OpenCL provide software infrastructure for hardware accelerator programming, the optimizations from their compilers are not always fully capable of giving the programmer the important level of control of the generated code. Sometimes this may lead to degraded performance. However, there are lower-level software layers that include intermediate representation (IR, AMD IL/HSAIL) or pseudo-assembly (LLVM, PTX) which could potentially give the programmer much better management of register allocation and memory affinity without disruption of optimal flow control. In this poster, we show the results of batched matrix factorization written in NVIDIA PTX for Tesla line of GPU cards. The experimental data demonstrate the performance advantages of the low-level programming in comparison with CUDA kernels.
Because of the ever-increasing data being produced by today’s large-scale scientific simulations, I/O performance is becoming a significant bottleneck for their executions. In this paper, we explore how to significantly improve the I/O performance for large-scale scientific simulations by leveraging an optimized lossy compressor. Our contribution is threefold. (1) We propose a compression-based I/O performance model and investigate the relationship between parallel I/O performance and lossy compression quality, by pro- cessing large-scale scientific simulation data on a supercomputer with various state-of-the-art lossy compressors. (2) We propose an adaptive, prediction-based lossy compression framework that can select the best-fit strategy from among our optimized prediction approaches in terms of the datasets, in that different prediction methods are particularly effective in different cases. (3) We evaluate the parallel I/O performance by using three large-scale simulations across different domains with up to 8,192 cores. Experiments show that our adaptive compressor can improve the I/O performance by about 100% compared with the second-best lossy compressor because of significantly reduced data size. In absolute terms, our compressor can improve the compression ratio by 112∼165% com- pared with the second-best lossy compressor. The total I/O time is reduced by up to 60X with our compressor compared with the original I/O time.
Lossy compression has been demonstrated very effective at extreme-scale scientific simulations in terms of improving IO performance, reducing disk space usage and so on. It has been applied to many scientific simulations in various science fields including molecular chemistry, fluid dynamics, climate, cosmology and so on. Such an important application will silently impact the correctness of scientific simulation results if it was hit by soft errors. The existing fault tolerance techniques usually will use redundancy at the cost of duplicated hardware or doubled execution time. In this work, we exploit the possibility to achieve resiliency for one of the most widely used lossy compressor, SZ, against SDC from the perspective of software or algorithm. Preliminary results have shown our proposed solution can increase the SDC resilience of SZ significantly with negligible overheads at large scale.
Task parallelism is omnipresent these days. Despite the success of task parallelism on CPUs, there is currently no performant way to exploit the task parallelism of synchronization-critical algorithms on GPUs. Due to this shortcoming, we develop a tasking approach for GPU architectures. Our use case is a fast multipole method for molecular dynamics (MD) simulations. Since the problem size in MD is typically small, we have to target strong scaling. Hence, the application tends to be latency- and synchronization-critical. Therefore, offloading as the classical programming model for GPUs is unfeasible. The poster highlights our experience with the design and implementation of tasking as alternative programming model for GPUs using CUDA. We describe the tasking approach for GPUs based on the design of our tasking approach for CPUs. Following this, we reveal several pitfalls implementing it. Among others, we consider warp-synchronous deadlocks, weak memory consistency and hierarchical multi-producer multi-consumer queues. Finally, we provide first performance results of a prototypic implementation.
The Holistic Measurement Driven System Assessment (HMDSA) project is designed to enable maximum science production for large-scale high performance computing (HPC) facilities, independent of major component vendor, and within budget constraints of money, space, and power. We accomplish this through development and deployment of scalable, platform-independent, open-source tools and techniques for monitoring, coupled with runtime analysis and feedback, which enables highly efficient HPC system operation and usage and also informs future system improvements.
We take a holistic approach through:
- Monitoring of all performance impacting information (e.g., atmospheric conditions, physical plant, HPC system components, application resource utilization) with flexible frameworks for variable fidelity data collection to minimize application and/or system overhead,
- Developing scalable storage, retrieval, and run-time analytics to provide identification of performance impacting behaviors,
- Developing feedback and problem (e.g., faults, resource depletion, contention) mitigation strategies and mechanisms targeting applications, system software, hardware, and users.
With the goal of performing exascale computing, the importance of I/O management becomes more and more critical to maintain system performance. While the computing capacities of machines are getting higher, the I/O capabilities of systems do not increase as fast. We are able to generate more data but unable to manage them efficiently due to variability of I/O performance. Limiting the requests to the Parallel File System (PFS) becomes necessary. To address this issue, new strategies are being developed such as online in situ analysis. The idea is to overcome the limitations of basic post-mortem data analysis where the data have to be stored on PFS first and processed later. There are several software solutions that allow users to specifically dedicate nodes for analysis of data and distribute the computation tasks over different sets of nodes. Thus far, they rely on a manual resource partitioning and allocation by the user of tasks (simulations, analysis).
In this work, we propose a memory-constraint modelization for in situ analysis. We use this model to provide different scheduling policies to determine both the number of resources that should be dedicated to analysis functions, and that schedule efficiently these functions. We evaluate them and show the importance of considering memory constraints in the model.
The Dynamical Exascale Entry Platform – Extreme Scale Technologies (DEEP-EST)1 aims at delivering a pre-Exascale platform based on a Modular Supercomputer Architecture (MSA) which provides among a standard CPU cluster module, a many-core Extreme Scale Booster (ESB), a Global Collective Engine (GCE) to speed-up MPI collective operations in hardware, Network Attached Memory (NAM) as a fast scratch file replacement, and a hardware accelerated Data Analytics Module (DAM); the latter are able to perform near-data processing.
Name | Affiliation | Arrival | Departure |
---|---|---|---|
Andreas Beckmann | Jülich Supercomputing Centre | 4/14/2019 12:00:00 | 4/18/2019 11:59:00 |
Andreas Lintermann | Jülich Supercomputing Centre | 4/14/2019 20:00:00 | 4/17/2019 13:45:00 |
Anne Nikodemus | Jülich Supercomputing Centre | 4/14/2019 19:00:00 | 4/16/2019 17:00:00 |
Bernd Mohr | Jülich Supercomputing Centre | 4/14/2019 20:00:00 | 4/18/2019 10:00:00 |
Brian Wylie | Jülich Supercomputing Centre | 4/7/2019 21:00:00 | 4/18/2019 9:00:00 |
Daniel Coquelin | Jülich Supercomputing Centre | 4/13/2019 18:00:00 | 4/17/2019 5:00:00 |
Ivo Kabadshow | Jülich Supercomputing Centre | 4/14/2019 1:00:00 | 4/18/2019 0:00:00 |
Laura Morgenstern | Jülich Supercomputing Centre | 4/14/2019 13:00:00 | 4/18/2019 12:00:00 |
Mirco Altenbernd | Jülich Supercomputing Centre | 4/14/2019 18:00:00 | 4/17/2019 18:00:00 |
Norbert Attig | Jülich Supercomputing Centre | 4/13/2019 20:00:00 | 4/17/2019 13:00:00 |
Ruth Schoebel | Jülich Supercomputing Centre | 4/13/2019 22:00:00 | 4/18/2019 5:00:00 |
Robert Speck | Jülich Supercomputing Centre | 4/13/2019 20:00:00 | 4/17/2019 13:00:00 |
Rocco Sedona | Jülich Supercomputing Centre | 4/14/2019 19:00:00 | 4/18/2019 15:45:00 |
Thomas Lippert | Jülich Supercomputing Centre | 4/14/2019 23:45:00 | 4/18/2019 14:50:00 |
Wolfgang Frings | Jülich Supercomputing Centre | 4/14/2019 19:00:00 | 4/17/2019 15:00:00 |
Anthony Danalis | The University of Tennessee | 4/15/2019 9:00:00 | 4/17/2019 15:00:00 |
Ana Gainaru | The University of Tennessee | 4/14/2019 2:00:00 | 4/17/2019 22:00:00 |
George Bosilca | The University of Tennessee | 4/14/2019 9:00:00 | 4/17/2019 18:00:00 |
Aurelien Bouteiller | The University of Tennessee | 4/15/2019 9:00:00 | 4/17/2019 15:00:00 |
Ali Charara | The University of Tennessee | 4/15/2019 8:00:00 | 4/17/2019 17:00:00 |
Dylan Chap | The University of Tennessee | 4/15/2019 9:00:00 | 4/17/2019 15:00:00 |
David Eberius | The University of Tennessee | 4/15/2019 8:00:00 | 4/17/2019 16:00:00 |
Damien Genet | The University of Tennessee | 4/14/2019 18:00:00 | 4/18/2019 11:00:00 |
Jack Dongarra | The University of Tennessee | 4/15/2019 8:00:00 | 4/17/2019 15:00:00 |
Danny Rorabaugh | The University of Tennessee | 4/15/2019 9:00:00 | 4/17/2019 15:00:00 |
Dong Zhong | The University of Tennessee | 4/15/2019 9:00:00 | 4/17/2019 15:00:00 |
Hartwig Anzt | The University of Tennessee | 4/15/2019 8:00:00 | 4/17/2019 15:00:00 |
Thomas Herault | The University of Tennessee | 4/15/2019 9:00:00 | 4/17/2019 15:00:00 |
Ichitaro Yamazaki | The University of Tennessee | 4/15/2019 9:00:00 | 4/17/2019 15:00:00 |
Jiali Li | The University of Tennessee | 4/15/2019 9:00:00 | 4/17/2019 15:00:00 |
Jon Calhoun | The University of Tennessee | 4/14/2019 18:00:00 | 4/17/2019 17:00:00 |
Joseph Teague | The University of Tennessee | 4/15/2019 9:00:00 | 4/17/2019 15:00:00 |
Jakub Kurzak | The University of Tennessee | 4/15/2019 8:00:00 | 4/17/2019 17:00:00 |
Piotr Luszczek | The University of Tennessee | 4/15/2019 0:01:00 | 4/18/2019 23:59:00 |
Michael Wyatt | The University of Tennessee | 4/15/2019 9:00:00 | 4/17/2019 15:00:00 |
Nuria Losada Lopez Valcarcel | The University of Tennessee | 4/15/2019 9:00:00 | 4/17/2019 15:00:00 |
Qinglei Cao | The University of Tennessee | 4/15/2019 8:30:00 | 4/17/2019 15:00:00 |
Gerald Ragghianti | The University of Tennessee | 4/14/2019 1:00:00 | 4/18/2019 1:00:00 |
Reazul Hoque | The University of Tennessee | 4/15/2019 2:00:00 | 4/17/2019 10:00:00 |
Ryan Marshall | The University of Tennessee | 4/14/2019 17:30:00 | 4/17/2019 15:30:00 |
Stephen Thomas | The University of Tennessee | 4/15/2019 9:00:00 | 4/15/2019 15:00:00 |
Michela Taufer | The University of Tennessee | 4/15/2019 9:00:00 | 4/17/2019 15:00:00 |
Terry Moore | The University of Tennessee | 4/15/2019 8:00:00 | 4/17/2019 15:00:00 |
Stanimire Tomov | The University of Tennessee | 4/15/2019 9:00:00 | 4/17/2019 15:00:00 |
Anthony Skjellum | The University of Tennessee | 4/15/2019 18:00:00 | 4/17/2019 15:00:00 |
Thananon Patinyasakdikul | The University of Tennessee | 4/15/2019 9:00:00 | 4/17/2019 15:00:00 |
Xi Luo | The University of Tennessee | 4/15/2019 9:00:00 | 4/17/2019 15:00:00 |
Asim YarKhan | The University of Tennessee | 4/15/0009 9:00:00 | 4/17/2019 23:59:00 |
Yu Pei | The University of Tennessee | 4/15/2019 8:00:00 | 4/17/2019 15:00:00 |
Yaohung Tsai | The University of Tennessee | 4/15/2019 9:00:00 | 4/17/2019 15:00:00 |
Azzam Haidar | The University of Tennessee | 4/15/2019 08:00:00 | 4/17/2019 15:00:00 |
Atsushi Hori | RIKEN R-CCS | 4/14/2019 16:00:00 | 4/18/2019 6:00:00 |
Toshiyuki Imamura | RIKEN R-CCS | 4/14/2019 21:00:00 | 4/18/2019 0:00:00 |
Jinpil Lee | RIKEN R-CCS | 4/14/2019 12:00:00 | 4/18/2019 10:00:00 |
Kentaro Sano | RIKEN R-CCS | 4/14/2019 15:00:00 | 4/17/2019 16:10:00 |
Kento Sato | RIKEN R-CCS | 4/15/2019 20:35:00 | 4/18/2019 8:15:00 |
Masaaki Kondo | RIKEN R-CCS | 4/14/2019 15:00:00 | 4/16/2019 15:00:00 |
Satoshi Matsuoka | RIKEN R-CCS | 4/15/2019 20:35:00 | 4/17/2019 13:10:00 |
Miwako Tsuji | RIKEN R-CCS | 4/15/2019 15:00:00 | 4/17/2019 15:00:00 |
Mitsuhisa Sato | RIKEN R-CCS | 4/15/2019 17:00:00 | 4/17/2019 15:00:00 |
Takahiro Ogura | RIKEN R-CCS | 4/14/2019 14:20:00 | 4/18/2019 8:15:00 |
Tetsuya Odajima | RIKEN R-CCS | 4/14/2019 20:35:00 | 4/17/2019 18:40:00 |
Yuetsu Kodama | RIKEN R-CCS | 4/15/2019 21:30:00 | 4/17/2019 15:00:00 |
Brendan McGinty | The University of Illinois | 4/15/2019 07:30:00 | 4/16/2019 15:00:00 |
Brett Bode | The University of Illinois | 4/14/2019 17:00:00 | 4/18/2019 9:00:00 |
Daniel Katz | The University of Illinois | 4/14/2019 22:00:00 | 4/18/2019 12:00:00 |
Colleen Heinemann | The University of Illinois | 4/14/2019 17:00:00 | 4/18/2019 9:00:00 |
Iftekhar Ahmed | The University of Illinois | 4/14/2019 12:00:00 | 4/17/2019 11:00:00 |
Laxmikant Kale | The University of Illinois | 4/16/2019 12:00:00 | 4/17/2019 14:00:00 |
Alina Kononov | The University of Illinois | 4/14/2019 15:00:00 | 4/18/2019 10:00:00 |
Scott Poole | The University of Illinois | 4/15/2019 17:00:00 | 4/17/2019 15:00:00 |
Mike Showerman | The University of Illinois | 4/14/2019 16:00:00 | 4/18/2019 12:00:00 |
Omri Mor | The University of Illinois | 4/15/2019 16:00:00 | 4/17/2019 17:00:00 |
Raghavendra Kanakagiri | The University of Illinois | 4/16/2019 23:00:00 | 4/17/2019 15:00:00 |
Aaron Saxton | The University of Illinois | 4/14/2019 12:00:00 | 4/18/2019 12:00:00 |
Robert Sisneros | The University of Illinois | 4/14/2019 16:00:00 | 4/17/2019 16:00:00 |
Shelby Lockhart | The University of Illinois | 4/14/2019 0:00:00 | 4/18/2019 0:00:00 |
Marc Snir | The University of Illinois | 4/14/2019 15:00:00 | 4/16/2019 18:00:00 |
William Gropp | The University of Illinois | 4/14/2019 18:00:00 | 4/17/2019 8:00:00 |
William Kramer | The University of Illinois | 4/14/2019 20:00:00 | 4/17/2019 17:00:00 |
Xiao Zhang | The University of Illinois | 4/14/2019 17:00:00 | 4/17/2019 10:00:00 |
Vincent Reverdy | The University of Illinois | 4/15/2019 17:00:00 | 4/17/2019 15:00:00 |
Bogdan Nicolae | Argonne National Laboratory | 4/13/2019 20:35:00 | 4/17/2019 17:20:00 |
Franck Cappello | Argonne National Laboratory | 4/15/2019 17:00:00 | 4/18/2019 10:00:00 |
Kazutomo Yoshi | Argonne National Laboratory | 4/16/2019 11:00:00 | 4/18/2019 10:00:00 |
Nicolas Denoyelle | Argonne National Laboratory | 4/16/2019 11:00:00 | 4/17/2019 17:00:00 |
Orçun Yildiz | Argonne National Laboratory | 4/15/2019 17:00:00 | 4/18/2019 9:00:00 |
Pierre Matri | Argonne National Laboratory | 4/14/2019 18:00:00 | 4/18/2019 8:00:00 |
Romit Maulik | Argonne National Laboratory | 4/15/2019 10:00:00 | 4/17/2019 13:00:00 |
Sheng Di | Argonne National Laboratory | 4/16/2019 11:00:00 | 4/17/2019 17:00:00 |
Sihuan Li | Argonne National Laboratory | 4/16/2019 10:00:00 | 4/17/2019 15:30:00 |
Swann Perarnau | Argonne National Laboratory | 4/15/2019 23:00:00 | 4/18/2019 9:00:00 |
Valentin Reis | Argonne National Laboratory | 4/15/2019 23:50:00 | 4/18/2019 3:00:00 |
Valerie Taylor | Argonne National Laboratory | 4/15/2019 23:00:00 | 4/16/2019 8:00:00 |
Justin Wozniak | Argonne National Laboratory | 4/14/2019 18:00:00 | 4/17/2019 15:00:00 |
Xin Liang | Argonne National Laboratory | 4/16/2019 10:00:00 | 4/17/2019 15:30:00 |
Zheming Jin | Argonne National Laboratory | 4/15/2019 11:00:00 | 4/18/2019 11:59:00 |
Christian Perez | INRIA | 4/14/2019 20:00:00 | 4/17/2019 17:00:00 |
Christine Morin | INRIA | 4/14/2019 22:00:00 | 4/18/2019 12:00:00 |
Eric Rutten | INRIA | 4/15/2019 23:45:00 | 4/17/2019 15:00:00 |
Gabriel Antoniu | INRIA | 4/14/2019 20:00:00 | 4/17/2019 17:00:00 |
Hongyang Sun | INRIA | 4/13/2019 17:00:00 | 4/18/2019 11:00:00 |
Julien Herrmann | INRIA | 4/14/2019 17:00:00 | 4/17/2019 18:00:00 |
Nathanaël Cheriere | INRIA | 4/13/2019 22:00:00 | 4/18/2019 15:00:00 |
Nick Schenkels | INRIA | 4/14/2019 20:00:00 | 4/18/2019 10:00:00 |
Pedro Paulo de Souza Bento da Silva | INRIA | 4/13/2019 18:44:00 | 4/17/2019 17:00:00 |
Valentin Honoré | INRIA | 4/13/2019 19:00:00 | 4/17/2019 17:00:00 |
Valentin Le Fèvre | INRIA | 4/13/2019 21:00:00 | 4/18/2019 14:00:00 |
Yves Robert | INRIA | 4/13/2019 15:00:00 | 4/18/2019 10:00:00 |
Constantino Gómez | Barcelona Supercomputing Center | 4/14/2019 15:00:00 | 4/17/2019 17:00:00 |
Germán Llort | Barcelona Supercomputing Center | 4/13/2019 9:00:00 | 4/18/2019 15:00:00 |
judit gimenez | Barcelona Supercomputing Center | 4/13/2019 0:00:00 | 4/17/2019 12:00:00 |
Leonardo Bautista Gomez | Barcelona Supercomputing Center | 4/15/2019 13:00:00 | 4/17/2019 13:00:00 |
Marc Jorda | Barcelona Supercomputing Center | 4/14/2019 16:00:00 | 4/18/2019 12:00:00 |
Jesus Labarta | Barcelona Supercomputing Center | 4/14/2019 19:47:00 | 4/17/2019 15:25:00 |
Ricard Borrell Pol | Barcelona Supercomputing Center | 4/14/2019 0:00:00 | 4/17/2019 19:00:00 |
Rosa Badia | Barcelona Supercomputing Center | 4/15/2019 18:30:00 | 4/18/2019 13:00:00 |
Sandra Catalán | Barcelona Supercomputing Center | 4/13/2019 17:00:00 | 4/19/2019 12:00:00 |
Vicenc Beltran | Barcelona Supercomputing Center | 4/14/2019 17:00:00 | 4/17/2019 19:00:00 |
Mark Moran | John Deere | 4/15/2019 | 4/16/2019 15:00:00 |