The workshop gathers leading researchers in high-performance computing from the JLESC partners INRIA, the University of Illinois, Argonne National Laboratory, Barcelona Supercomputing Center, Jülich Supercomputing Centre, RIKEN R-CCS and The University of Tennessee to explore the most recent and critical issues in advancing the field of HPC from petascale to the extreme scale era.
The workshop will feature sessions on these seven central topics:
In addition, dedicated sessions on computational fluid dynamics, computational biology and climate/weather research are planned.
A key objective of the workshop is to identify new research collaborations and establish a roadmap for their implementation.
The workshop is open to Illinois, INRIA, ANL, BSC, JSC, Riken R-CCS and UTK faculties, researchers, engineers and students who want to learn more about Post-Petascale / Pre-Exascale Computing.
Track 1
Zoom link: ![]() |
Track 2
Zoom link: ![]() |
|
08:00 ET | Opening
Franck Cappello, Robert Speck |
|
08:10 ET | ST M1.1 (6) AI
Session chair: George Bosilca |
Multiprecision Numerics for HPC |
09:40 ET | ST M1.2 (6) I/O, In-situ
Session chair: Ruth Schöbel |
|
11:10 ET | Open zoom sessions (all zoom sessions will remain open until 1PM ET)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Track 1
Zoom link: ![]() |
Track 2
Zoom link: ![]() |
|
08:00 ET | ST M2.1 (6) Numerical Methods and Resilience
Session chair: Kevin Sala |
Challenges and opportunities with running AI workloads on HPC systems |
09:30 ET | ST M2.2 (6) Performance Tools
Session chair: Daichi Mukunoki |
|
11:00 ET | Open zoom sessions (all zoom sessions will remain open until 1PM ET)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Track 1
Zoom link: ![]() |
Track 2
Zoom link: ![]() |
|
08:00 ET | ST M3.1 (6) Programming Languages and Runtimes
Session chair: Daniel Barry |
Heterogeneous and reconfigurable architectures for the future of computing |
09:30 ET | ST M3.2 (6) Programming Languages and Runtimes and Advanced Architectures
Session chair: Colleen Heinemann |
|
11:00 ET | Closing
Franck Cappello, Robert Speck |
|
11:10 ET | Open zoom sessions (all zoom sessions will remain open until 1PM ET)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
As time progresses, the volume of data surrounding machine learning and AI methods continues to grow from training, testing, and validation datasets to the models themselves. As the volume grows, there are increasing challenges in transporting and storing the data. Lossy compression techniques present the opportunity to drastically reduce the volume of data while maintaining or even improving upon the quality of the decisions made by AI. This talk presents a survey of novel research examining the effect of lossy compression on AI decision making in a variety of domains including medical data science and physics. We examine the effects of data ordering, error bounds, compression methodologies to make general recommendations from these and other areas regarding how to effectively leverage lossy compression in AI using the common LibPressio interface.
TBD
CyberInfrastructure (CI) is increasingly harnessed to support the use of artificial intelligence (AI) techniques and applications to enable smart cities, power grids, computational simulations, real-time processing in big-data experiments (LIGO), supercomputing, and other important innovations. Computation is essential to developing and using the requisite analytics, while CI networks and data facilities are central to moving and managing the immense volumes of data. One might refer to this as CI for AI. Just as AI offers promise to revolutionize all aspects of our lives, the potential for improving the design, evolution, optimization, performance, operation, security and sustainability of CI is equally promising. We refer to AI applied to challenges of improving CyberInfrastructure as AI for CI.
A convergent research approach is required to explore the application of AI for CI, comprising experts from across the spectrum of cyberinfrastructure research, design, and deployment; computer and computational sciences; and systems design and engineering. To this end, we propose to create an AI for CI research activity to drive transformative exploration of CI using AI methods.
What AI methods are best for what purposes? What training is available? What are common research challenges identified across the exemplar CI families above all share some common themes: 1) the systems are so complex that only domain experts can make reasonable and repeatable optimization decisions and the complexity is growing exponentially, 2) decision making should be performed with minimal lag time for optimum performance but the consequences of those decisions, in terms of human welfare and/or monetary cost, can be high (e.g., loss of life and/or significant money), 3) processing telemetry data is a “big data” problem in itself with requirements for low latency time to solution in the presence of perhaps incomplete data, 4) individual data values are bounded but the high dimensionality and variability in aggregate makes full ensemble processing, within effective latency bounds, currently impossible, 5) social aspects of trusting the judgment of a machine must be overcome for significant deployment, 6) workload and resource scheduling, swarm coordination/management, resiliency, energy efficiency, ease of use by developers and end user and explainability, and 6) lack of labeled datasets as well as the high-cost of generating labeled datasets for training.
We will describe how the CANDLE/Supervisor system is used for configuration tuning using hyperparameter search and optimization. We will also illustrate its use in the analysis of underlying training data, including noise analysis and the search for important data to the application results. Performance and concurrency results will also be shown from running the system for cancer benchmark applications on the Summit system.
Multiple questions relevant to the Workflows and I/O projects will be raised including: How to manage storage for data pre-processing for model training; how to schedule various tasks in the training workflow; how to manage logical "working sets" of input and output data across runs efficiently.
A protein's structure determines its function. Different proteins have different structures and proteins in the same family share similar substructures and thus may share similar functions. Additionally, one protein may exhibit several structures (or conformations). Classifying different protein conformations can help solve problems such as determining the cause of diseases and designing drugs. X-ray Free Electron Laser (XFEL) beams are used to create diffraction patterns (images) that can reveal protein structure (and function).
The translation between diffraction patterns in the XFEL images and protein structures is non-trivial. In this talk we present the first steps into the design and assessment of a software framework for the classification of XFEL images. We call the framework XPC (XFEL-based Protein Classifier). We quantify the classification accuracy and performance of XPC for protein diffraction imaging datasets including realistic noise, different protein orientations and conformations. This project is collaborative research between UTK and RIKEN.
What is the suitability of the current framework for real-world cases (e.g., 3-D reconstruction of protein in general and ribosome in particular, annotation of images for their classification without human intervention) ?
What is the range of errors (i.e., distribution of orientation and/or conformation) tolerable for the scientists?
What are the costs for training and predictions?
Can we use systematic data from simulations to train the framework and use the trained models for classification/prediction of real experiment data successfully?
How can we augment the framework to also classify / predict protein structures (currently the framework captures orientations and conformations)?
This work introduces the notion of state preservation, which generalizes checkpointing into productive use cases. It focuses in particular on deep learning applications, which exhibit significantly different behavior at large scale compared with traditional HPC applications. This has important consequences for state preservation, which will be discussed by the talk, concluding with a series of early results.
How deep learning application can use model checkpoints and dependencies between them to build better training strategies? How can caching of training data reduce sampling overhead? How can nodes collaborate to build distributed checkpoints without involving the parallel file system?
Workflow systems promise scientists the ability to accelerate the scientific process by coordinating multiple computing tasks. Optimizing and controlling workflow are vital to this process, and failing to do so can significantly delay scientific discoveries and waste precious resources. Scientists often configure workflows manually with trial and error based on their expertise. In this talk, we will discuss how AI techniques can be used to automatically trigger changes in the workflow with the goal of improving performance. We hope that this talk will set up potential collaboration(s) towards this goal.
How AI techniques can be used to automatically trigger changes in the workflow with the goal of improving performance and resource utilization? We hope that our talk will set up potential collaboration(s) towards this goal.
Earth system models (ESMs) have increased spatial resolution to achieve more accurate solutions. As a consequence, the number of grid points increases dramatically, so an enormous amount of data is produced as simulation results. However, some ESMs use inefficient sequential I/O schemes that do not scale well when many parallel resources are used. This issue is typically addressed by adopting scalable parallel I/O solutions.
This work analyzes and improves the I/O process of the Integrated Forecasting System (IFS), one of the most important atmospheric models used in Europe. IFS can use two different output schemes: the MF I/O server and an inefficient sequential I/O scheme. The latter is the only scheme that can be used by non-ECMWF users. Additionally, these two output schemes originally attached in IFS produce the outputs in GRIB format, which is not the standard used by the climate community.
Therefore, it is presented an easy-to-use development that integrates an asynchronous parallel I/O server called the XML I/O Server (XIOS) into IFS. XIOS offers a series of features that are especially targeted for climate modelling: netCDF format data, online diagnostics and CMORized data.
oreover, an analysis is done to evaluate the computational performance of the IFS-XIOS integration, proving that the integration itself is adequate in terms of computational efficiency, but the XIOS performance is very dependant on the output size of the netCDF files. Therefore, the HDF5 compression is tested to try to reduce the size of the data so that both I/O time and storage footprint can be improved.
However, the default lossless compression filter of HDF5 does not provide a good trade-off between size reduction and computational cost. Thus, it is necessary to explore lossy compression filters that may allow reaching high compression ratios and enough compression speed to considerably reduce the I/O time while keeping high accuracy.
What types of lossy compression are more suitable for weather forecast and climate prediction? Can we use same compression for all variables?
In particular, we are interested in the SZ compressor from ANL. Do you think it is suitable for our needs? Is it compatible with parallel I/O?
Thus, might there be a potential collaboration between ANL and BSC?
The SZ compressor is already registered as a third-party filter of HDF5. We would like to explore with ANL if it is enough for XIOS, or we would need to develop a particular solution.
In HPC platform, concurrent applications are sharing the same file system. This can lead to conflicts, especially as applications are more and more data intensive. I/O contention can represent a performance bottleneck. The access to bandwidth can be split in two complementary yet distinct problems. The mapping problem and the scheduling problem. The mapping problem consists in selecting the set of applications that are in competition for the I/O resource. The scheduling problem consists then, given I/O requests on the same resource, in determining the access order to those accesses to minimize concurrency minimize the I/O time. The first part of our work takes into account the iterative behavior of HPC applications to design a model and study list-scheduling heuristics to the scheduling problem. In a second part, we design a data-aware partitioning of the platform for addressing the mapping problem. For building the solution of this problem we use a continuous model, built to represent a steady-state, in order to balance the application bandwidth on the whole machine.
Future works may involve extending the model, discussing sensibility of the policies, characterizing on which workloads our policy work better,.
e may also try to work from a model with exclusive access to bandwidth to ressource sharing.
Distributed digital infrastructures for computation and analytics are now evolving towards an interconnected ecosystem allowing complex applications to be executed from IoT Edge devices to the HPC Cloud (aka the Computing Continuum, the Digital Continuum, or the Transcontinuum). Understanding end-to-end performance in such a complex continuum is challenging. This breaks down to reconciling many, typically contradicting application requirements and constraints with low-level infrastructure design choices. One important challenge is to accurately reproduce relevant behaviors of a given application workflow and representative settings of the physical infrastructure underlying this complex continuum. In this paper we introduce a rigorous methodology for such a process and validate it through E2Clab. It is the first platform to support the complete analysis cycle of an application on the Computing Continuum: (i) the configuration of the experimental environment, libraries and frameworks; (ii) the mapping between the application parts and machines on the Edge, Fog and Cloud; (iii) the deployment of the application on the infrastructure; (iv) the automated execution; and (v) the gathering of experiment metrics. We illustrate its usage with a real-life application deployed on the Grid'5000 testbed, showing that our framework allows one to understand and improve performance, by correlating it to the parameter settings, the resource usage and the specifics of the underlying infrastructure. This work will be presented at IEEE Cluster 2020.
Future work and collaboration may involve:
(i) Enable built-in support for other large-scale experimental testbeds (besides Grid’5000) such as Chameleon and more.
(ii) Leverage E2Clab to build a benchmark suite for applications running on the Computing Continuum.
(iii) Help E2Clab evolve towards an emulation framework for applications in the Edge-to-Cloud Computing Continuum. After reviewing the literature, we observed that existing solutions (emulation and simulation tools) focus just on the Cloud, Fog, or Edge. E2Clab could be the first one to support the Edge-to-Cloud Continuum. Currently, E2Clab supports network emulation, and it could be extended to emulate common behaviors of applications in the Edge-to-Cloud Continuum, such as resource constrained devices, failures (network, services, etc.), and mobility.
(iv) Deploy more real-life use cases using E2Clab to perform large-scale experiments; and extend E2Clab's features (if needed) to support the new use case. Do you have any use cases to share?
Large-scale applications and workflows in scientific domains such as weather forecast or cosmology have increasing I/O needs. However, the storage has been designed for decades as a global shared resource, making I/O one of the main bottleneck of HPC systems. To mitigate this congestion, new tiers (node-local SSDs, burst buffers, network-attached storage, and so on) have been added to recently deployed supercomputers increasing their complexity. An open question is how to provision those resources to the jobs. Indeed, it is common for HPC systems to provide dynamic and exclusive access to compute nodes through a batch scheduler but little has been done for dynamically provisioning storage resources. In this talk, we present our preliminary work on scheduling techniques for storage resources.
- Scheduling algorithm minimizing I/O interference between 'storage jobs'
- We are also seeking access to raw network-attached storage (NVMeoF). Does anybody have that?
In this work, we are interested in processing "2nd generation" applications on HPC facilities.
The main properties of those applications are a huge variability in terms of both execution time and memory usage.
For example, such applications can be found in Neuroscience and ML. They can take a (deterministic) walltime of few hours to few days, depending on the input.
This variability becomes an issue when those applications are submitted on HPC platforms, which requires a good knowledge of application walltime. Similarly, the variability in memory consumption is a limitation when using cloud facilities.
To execute such applications on HPC platforms, usual way is to "guess" a reservation time, based on logs of previous runs.
However, due to the variability in walltime, strategies such as requested the maximum walltime so far observed or periodic strategies are inefficient and costly.
They either overestimate the walltime, leading to resource wastage or underestimate it, leading to wastage in computations (job is killed before terminating).
In our work, we propose to model this decision process and provide an optimal "sequence of reservations" given a probability distribution of job's execution time.
We also extend this approach by including resilience in some well-chosen reservations.
By proposing new scheduling approach, we validate the performance of our strategies by using both simulation and real experiments on Neuroscience applications developed in Vanderbilt University.
In this talk, we will present our approach of this problem, as so as some results and future works.
We expect to discuss with the community about
1) our approach and possible new ideas for managing such applications
2) what are the needs in terms of application, scheduler and runtime to manage such applications and how to implement it on HPC clusters
We would like to discuss with sys admins and app developpers about this problem.
3) future directions and collaboration opportunities
The Melissa-DA framework, inherited from Melissa, a tool for highly parallel massive sensitivity analysis, allows to perform ensemble based Data Assimilation with thousands of Ensemble Members at scale. Melissa-DA builds up to a distributed high performance computing application made of many small jobs that dynamically connect to a server job. All parts of the distributed application can abort independently, e.g. to be restarted from the last checkpoint using more or less compute resources to adopt to compute resource availability. Checkpointing is based on the Fault Tolerance Interface (FTI). The assimilation’s update step is based on the Parallel Data Assimilation Framework (PDAF).
- Application of similar workflows to solve different problems (e.g. Inverse Problems) using huge ensembles and Monte Carlo Methods
Automatic Differentiation (AD) is an increasingly common method to obtain gradients of numerical programs, for use in machine learning, optimization, or uncertainty quantification. An emerging challenge for AD is that numerical programs are written in a growing number of parallel dialects, such as OpenMP, CUDA, OpenCL, Kokkos. It seems desirable to have a model, or intermediate representation, that would allow AD tools to reason about the differentiation of general parallel programs, so that new parallel dialects only require a new frontend, not a new AD tool. We would like to investigate intermediate languages that are general enough to capture commonly used parallelization features used in practice, while being restrictive enough to allow efficient differentiation and code generation.
What abstractions (PGAS? Something else?) are most useful to express parallelism for the purpose of AD? How can we best implement AD on these abstractions? How can we turn the AD'ed intermediate representation back into e.g. efficient CUDA/OpenMP code? A collaboration with e.g. the JLESC participants working with OmpSs2, Argobots, etc, could be useful in answering these questions.
Solving time-dependent differential equations in parallel has been studied in many ways. The standard way of solving these equations is sequential in the time variable, but parallel in space variables. Therefore, better mathematical concepts are needed to fully exploit massively parallel architectures. For the numerical solution of time-dependent processes, time-parallel methods have opened new ways to overcome scaling limits.
In this talk, we first discuss how to form N coupled equations originating from a collocation problem, for N time intervals which are needed to reach the desired solution. Second, in the linear case, a diagonalizable preconditioning operator is used and standard Richardson iterations are performed. This allows us to independently solve decoupled problems on each processor of the same size as in the sequential approach. The goal is to reach a global solution on a longer time interval in no more than the number of iterations needed for one sequential propagation on a smaller interval. First prototype results are shown for the linear case. Also, a nonlinear approach is discussed.
We are looking for both interesting applications of the method and new ideas to improve our preconditioners, in particular for nonlinear problems where solid and reliable mathematical foundation is still missing.
Within the HPC community a lot of work has been done to optimize matrix-matrix multiplications, also known as GEMM. This function is well known as the most optimized function, and is able to obtaining close to peak performance. While there exists a lot of implementations and libraries for matrix-matrix operations, the more generalized class of operations, tensor-tensor operations and in particular non-standard tensor-tensor operation, has not received as much attention yet. While there exists a few libraries that cover the basic cases, there exists tensor-tensor operations that are not covered by these basic cases. Analyzing the kernels in our simulation code lead to identifying three properties that vary from the basic cases: i) Multiplying more then two tensors, ii) index manipulations, e.g. an index shift, iii) dealing with different tensors types, e.g. multiplying a symmetric tensors, where only one half is stored, with a non-symmetric tensor. In this regards, there are several important open questions: how can we optimized these classes of operations by apply the lessons learned from optimizing GEMM: adjusting the loop order, blocking and packing for the different cache levels and registers, calling an highly-optimized kernels at the inner most loop, and parallelize the surrounding loops.
- General Guidelines for optimizing tensor tensor multiplications that can not be mapped to available numerical libraries
- Exploiting the architecture specific parameters to tune the blocking of the algorithm
Quantum computing is computing using quantum-mechanical phenomena, such as superposition and entanglement. It holds the promise of being able to efficiently solve problems that classical computers practically cannot. The objective of QOC is to devise and implement shapes of pulses of external fields or sequences of such pulses that reach a given task in a quantum system in the best way possible. QOC requires the computation of adjoints using reverse mode automatic differentiation (AD). Current approaches to QOC do not exploit the fact that the conjugate transpose of the unitary matrix is also its inverse. Naive implementation of reverse mode AD requires the storage of unitary matrices of size 2^q *2^q where q is the number of qubits. Current approaches to QOC do not exploit the fact that the conjugate transpose of U is also its inverse. We investigated strategies for reverse mode AD that exploits this fact and apply it to a QOC problem.
We have shown that the overall error in exploiting the reversibility of unitary matrices could be related to the error in computing a matrix exponential using the Padé approximation approach. We are currently implementing QOC within JAX, a high-performance machine learning framework. As part of this effort, we expect to exploit the reversibility of unitary matrices.
Shared Infrastructure for Source Transformation Automatic Differentiation"
Over the past eight years of the operation of Blue Waters a vast amount of data has been collected for all aspects of the system. This data set includes system logs, IO profiles (Darshan), and performance counter data for all processors, memory, IO and networks. NCSA has now made this data set available for public download and use to enable research on the performance and resilience of large-scale systems. This presentation will describe the Blue Waters data set and give a few highlights on the reliability and availability characteristics of the system.
The quantity and breadth of data on all aspects of the operations of Blue Waters opens the door to investigations in areas of system resilience and tool development. NCSA is interested in collaborating on projects that can use the data set to develop tools for the automation of fault detection and mitigation and on tools that would automate the process of identifying problematic application behavior.
This talk introduces early results on the use of machine learning for the purpose of fine-tuning multi-level checkpointing, using VeloC as a reference checkpointing framework. The talk will first introduce VeloC, insisting on the configurable parameters (notably checkpoint interval). Then, it will present a technique that uses partial simulation to find the optimal checkpoint intervals for specific configurations, while relying on machine learning to train on these configurations and infer the optimal checkpoint interval for related configurations.
Alternatives to optimize checkpointing intervals? Other parameters/fine-tuning opportunities based on machine learning?
HPC machines landscape continuously evolves toward more performant and heterogeneous systems of specialized hardware. It is expected that a new machine will enable better performance. However, efficiency, i.e the achieved fraction of the system performance upper-bounds should also be considered when evaluating the machine ROI. Indeed, aside from shining in HPL rank, high-end HPC systems will also have to run science on a variety of workloads. These workloads will stress different parts of the machine and expected gains should be set accordingly. In order to set expectations, it is necessary to characterize optimized applications bottlenecks and benchmark the system on this specific aspects. In 2014, the Cache Aware Roofline Model (CARM) set a baseline for representing applications performance relatively to HPC compute nodes performance. Bounds represented in the CARM were the memory hierarchy bandwidths and floating point units throughput. In 2019, we augmented the model, taking into account locality. We also provided a finer characterization of achievable memory bandwidth under constraints such as application read/write mix or the fast/slow memory allocation mix. Lately, we started exploring applications memory access pattern and match it with the machine topology to evaluate applications time to transfer its data from main memories to compute units. Our goal is to be able to evaluate a fine upper-bound of achievable memory bandwidth with a new hypothetical computing system and known application.
As our priorities have shifted, we have only little time to involve carrying this research. We are looking for interested collaborator to take ownership of the project.
The DGEMM function is a widely used implementation of the dense matrix product. While the asymptotic complexity of the algorithm only depends on the sizes of the matrices, we show that the performance is impacted in a stastically significant way by the matrices content. Our experiments show that this may be due to bit flips in the CPU causing an energy consumption overhead, which in turn forces the CPU to lower its frequency.
This effect has been observed on most of the recent Intel CPU generations.
This phenomenon has been noticed thanks to a meticulous experimental methodology. Our approach allows us to better study and understand the spatial and temporal variability of supercomputers, a key point when applications need to be executed at large scale.
Collaboration opportunity with Argonne National Laboratory.
Tuning configurations of a scientific application is not a trivial task, even more, when the tool can have more than 100 parameters ranging from algorithm parameters (like the numeric error limit of the solver) to HPC related parameters (like the number of processes to use). The literature has several approaches for the automatic tuning of workloads, from Direct Search Methods to Model-based optimization or Genetic Programming. State of the art autotuners is based on iteratively evaluating jobs with different configurations, choosing smartly the samples to find good candidates fast. Most optimizers are based on black-box optimization processes which do not take any information from the workload (besides an app identifier or some other extra feature like the dataset size). Scientific workloads can take a considerable amount of time, making the optimization process, even for a small number of runs, not ideal.
In this talk, we will present a new method to tune scientific engineering workloads that accomplish a good performance from a single run of the job. To achieve this, we mine the logs that are generated by the applications after a single run of the workload. Since the logs contain a lot of internal information from the execution, we can build a model that is trained on low-level features from the application, building a descriptor from that to learning algorithms. This allows our system to predict sensible configurations for a large number of parameters from unseen jobs, given that it has been trained with reasonable coverage of applications. Experiments show that the presented system can correctly produce good configurations from a single run, making it more practical to deploy on HPC environments, archiving significantly speedup with respect the default configuration and gain over other technics.
The current results show good accuracy in the prediction of the execution time of the workloads for fixed given hardware. We can do a transfer learning to other HW that we have benchmarks to extrapolate, but how we can extend the work in a more general way? The next generation of systems (Exascale) poses new complexity with exponential possibilities of new architectures (non-volatile memory, accelerators etc.).
Besides not being the main focus of the talk and poster proposed, as most of the reservoir engineering application in O&G uses space matrices, we can collaborate in the Numerical Methods project by sharing matrices of this specific domain and/or in discussions of techniques used in our space parallel linear solvers.
Performance in different CPU architectures can be monitored by reading the occurrences of various hardware events. However, from architecture to architecture, it is not clear which native hardware events are indexed by which event names, making it difficult for the performance analyst to understand how to measure specific events.
To alleviate this difficulty, the Counter Analysis Toolkit (CAT) aims to measure the occurrences of events through a series of benchmarks, classify particular events of interest via data analysis of the patterns of event occurrences, and allow its users to discover the high-level meaning of native events.
How can we automatically classify native events from architectures with fundamentally different specifications (e.g. differing numbers of cache levels and cache line sizes)?
What are robust classification models which can work based on a few data samples?
For a given architecture, there may only be a few of event names which index a native event. But our goal is to find similarities between event occurrence patterns across architectures. How can we reconcile these two concepts? There is a need for models which can recognize nuanced patterns while being trained on a relatively small number of samples.
Run-to-run performance variations originating from hardware deficiencies result in severe execution inefficiencies for HPC applications on current leadership computer systems with increasing impact at larger scales. Specific actual cases will be summarised, along with potential steps for amelioration.
How to identify affected applications? How to identify (and replace) deficient hardware components in a timely fashion? How much inefficiency is acceptable for large-scale HPC? Collaboration may involve system monitoring and application performance reporting, as well as application strategies for amelioration.
In the dawn of exascale computing, such code metrics as arithmetic intensity are increasingly important for codes which heavily utilize numerical kernels. However, confusing native event names in modern architectures render certain bandwidth-related events difficult to identify. Additionally, the IBM POWER9 architecture in the Summit supercomputer not only natively supports single- and double-precision, but also quadruple-precision computation. Thus, there are multiple categories of arithmetic intensity.
The Uncore and Performance Co-Pilot (PCP) components in PAPI allow users to measure off-core events for such architectures as Intel Broadwell, Intel Skylake, and IBM POWER9 architectures. We leverage these components to explore the CPU’s off-core to rigorously define arithmetic intensity via native hardware events.
The HPC software landscape has never been more intricate, as the number of viable alternatives for writing HPC applications has risen in the recent years. MPI + OpenMP (FORTRAN or C), Kokkos, Raja and the newer SYCL/DPC++ are all valid candidates. Fortunately, those high-level APIs target low level back-ends that provide their own APIs. Understanding the interrelations between the high-level programming models and the low-level programming models is of paramount importance if we are to understand, debug, project, and optimize application performance.
In this talk we will present a strategy for model-centric tracing of low level APIs and illustrate it with the case of the OpenCL API. We will show some of the benefits our approach allows, such as: in-depth profiling, lightweight monitoring, computing kernel extraction, or validation of the OpenCL state.
Applications of model-centric low-level tracing of HPC APIs are wide ranging. Collaborations could be encompassed with teams doing simulation, or runtime systems for HPC. Online platform and software monitoring is also a topic we could build collaboration on.
I will present the main research directions of my thesis about interactions between task-based runtimes and communication layer. Four general ideas will be explored during my thesis: optimisation of collective communications, improving polling policy according to the activity of the runtime, anticipating future communications and defining communication patterns produced by task-based runtimes.
Looking for applications with specific communication patterns and methods/tools to analyze performances.
Hierarchical parallelism is everywhere. To exploit the full range of parallel resources, different parallelization approaches are required. Intra-core resources (SIMD) and intra-node resources (threads) can be handled directly in modern languages like C++. However, for distributed resources no direct language support exists. Typically, third party libraries like MPI or PGAS frameworks like HPX are required to exchange data. For compute-bound problems these approaches are well suited. However, if the runtime is dominated by overhead (communication, synchronization, latency) both approaches pose difficulties. Additionally, using MPI in a multi-threaded environment (with any thread sending or receiving) does introduce additional overhead.
In this short talk we present the current state of our FMSolvr library in the light of hierarchical parallelism and its limitations.
- alternative approaches beyond MPI for low-latency inter-node CPU-CPU communication
- support for low-overhead multi-threaded communication
- support for CPU-GPU communication
- support for GPU-GPU communication
Hybrid parallel programming combines distributed-memory and shared-memory programming models to exploit inter- and intra-node parallelism, respectively. However, such models, like the current standards of MPI, GASPI, and OpenMP, are not explicitly designed to be combined. Application developers tend to simply serialize communication phases, which reduces the effective parallelism of applications and introduces artificial synchronization points among processes at phase's switching.
The Task-Aware MPI (TAMPI) library improved the programmability and performance of hybrid applications by facilitating the taskification (with OpenMP and OmpSs-2) of blocking/non-blocking MPI communications. Following that idea, we present the Task-Aware GASPI (TAGASPI) library, which focuses on improving those same aspects when taskifying GASPI's one-sided operations.
In this way, app developers can easily taskify both computation and communication phases allowing the natural overlap of computation and communication tasks. Task dependencies declared over the real data structures guarantee the correct execution order of those tasks. Furthermore, with this approach, there is no need for extra synchronization points between different phases or abstract dependencies for serializing communication tasks.
We would like to open new collaborations with the objective of porting applications or mini-apps to leverage the TAGASPI library, or even TAMPI. Those applications would probably improve their programmability and performance, and we would have the opportunity to study new use cases for our libraries and models.
HPC aims at maximizing performance of an application on a given architecture. The classical optimization work flow is based on identifying hot spots or bottlenecks and on optimizing them before making new performance evaluations. Natural design of HPC applications reflects this model where a set of basic building blocks is identified for its performance penalties if not optimized. Most of the HPC applications are using standard basic building blocks such as dense linear algebra kernels or FFT embedded into libraries or frameworks.
Our medium term goal of our project aims at providing a methodology to answer the question of how to productively achieve high level of performance of complex software in presence of code evolutions and architecture variations for domain developers. A major scientific objective of the project is then to propose a programming model that enables flexible and maintainable HPC applications to make them robust to code evolutions and architecture variations with feedback to the user.
The presentation will focuse on our recent experiences motivating our project :
- efficient composition of BLAS subroutines on multi-GPUs node (typically Nvidia DGX-1)
- tracing tools for OpenMP programs based on OMPT
.
We are going to present two challenging questions related to the runtime:
1. Can runtimes sustain high level of performance in presence of complex building block composition with thousands of potential concurrent activities?
2. How to report performance between execution and the high level description of the application?
I will present the latest advances in OmpSs, OpenACC, use of heterogeneous memory systems and tasking in GPU carried out at the Accelerators and Communications for HPC team at BSC.
Seeking mainly for users, but also research partnership.
There is now a large and ever growing number of workflow systems, and we have lost hope in encouraging users not to continue developing more. Instead, we want to focus on building shared elements that can help us with our own systems, as well as the users of those systems and the developers of applications that will increasingly be used as workflow elements in simulation, analysis, search, optimization, and parameter study research campaigns.
As discussed briefly at JLESC 9 (in ST A1.1. Python-based Workflows & Software Sustainability), two of the common types of elements that workflow systems interact with are the end computing systems and the preexisting applications that the workflows wrap and call. Today, users of a workflow system have to find information about both the end points and the applications, they have to map that information to workflow-specific configuration formats, individually customize their workflow to use these configurations, and keep up with changes over time. Instead, we propose a registry of compute end points and applications, where an entry could be automatically brought in to a workflow system.
This requires a pair of components:
1 - The registry itself and a means for adding and editing entries, potentially along with curation, or perhaps community curated, using WikiData
2 - A means to use entries for a given workflow system
Registry entries could be added by three different groups:
1 - Compute resource providers could enter their systems, and application developers could enter their applications
2 - Workflow system providers could enter systems and applications that they support, or we could collect published configurations and map them to our common schema
3 - Workflow developers could enter system and applications that they and their workflow users need
The initial work that we propose here is
1 - defining the schema for the registry, and implementing it as a REST service
2 - building some test elements, and entering them manually
3 - building software for Parsl and PyCOMPs to import and use registry entries
We plan to do this initial work in the summer of 2020.
Open questions:
How to best represent computational end points and applications
How to best build a registry of these elements, and curate the information in it
How to bring this information into workflows specific to different workflow systems
We are now looking for any additional groups who would like to be involved in this project, potentially application developers whose applications are used in multiple workflow systems, owners/managers of HPC systems that are used by multiple workflow systems, information science/systems experts, and workflow system developers
We provide insights into the development of a fine-grained tasking for GPUs. The short talk starts out with an overview of our CUDA C++ tasking framework for fine-grained task parallelism on GPUs. After a short trip into the world of persistent threads, synchronization mechanisms and load balancing, we dig a bit deeper by addressing the performance-question. On that point, we examine several locking mechanisms provided by libcu++, Nvidia's freestanding Standard C++ library for GPUs. For each locking mechanism, we analyze its performance gains on an Nvidia Tesla V100 for a prototypic implementation of a task-based fast multipole method for molecular dynamics.
We are looking for collaborations in the areas of
- Message passing for multi-GPU tasking
- Message passing for tasking on heterogeneous CPU/GPU architectures
- Fine-grained tasking on FPGAs
The Message Passing Interface (MPI) plays a critical part in the parallel computing ecosystem, a driving force behind many of the high-performance computing (HPC) successes. To maintain its relevance to the user community, in particular to the growing HPC community at large, the MPI standard needs to identify and understand the MPI users' status, concerns and expectations, and adapt accordingly. This survey was conducted from February 2019 by using online questionnaire frameworks, and more than 850 answers from 42 countries has gathered at the time of this writing. There are some predecessors of this kind of survey, however, our survey is much more scaled in terms of geographically and the number of participants. As a result it is possible to illustrate the current status of MPI users more precisely and accurately.
This survey is still open because we hope to increase the number of participants, especially from some countries, including USA, Japan, Spain and China. I would like you to help and/or join us to get more participants especially from those countries and to write the final report.
MPI libraries have bad performance when MPI_THREAD_MULTIPLE is used. This is due to the coarse grain locking that is used by MPI to ensure the atomic execution of critical operations; and by the lack of communication between MPI and the thread scheduler: The completion of a blocking communication does not inform the thread scheduler which thread has become runnable. We have developed LCI, a communication library that use finer grain locking and that interact direction with a light-weight thread scheduler, ushc as Argobots. The library has been use to improve the performance of D=Galois, a graph analytics library and we are currently poorting PaRSEC to LCI.
Our goal is to port additional scientific libraries on top of LCI in order to validate our desgin and improve their performance.I particular, we plan on a collaboration with Ivo Kabadshow from Julich in order to port his FMM library to LCI
Many of today’s HPC scheduler rely mostly on precise information on the applications runtime (reservation-based scheduling). New stochastic profiles of applications (Big-Data/ML). Scheduling as usual is not an option anymore. Need to change the scheduling paradigm. Speculative scheduling: reservation strategies for job whose execution time follows probability distribution. We have shown that with simple implementations, it brings large benefits, both for the (i) platform utilization, and (ii) the average job response time.
- understanding the type/behavior of new applications coming to HPC systems
- understanding the needs both from a sysadmin perspective, and from a user perspective of future batch schedulers
Update on the SVE activities done in the Mont-Blanc 2020 project
I would like to collaborate on benchmarking HPC applications on the first SVE processor (i.e. Fujitsu's post-K)
As time progresses, the volume of data surrounding machine learning and AI methods continues to grow from training, testing, and validation datasets to the models themselves. As the volume grows, there are increasing challenges in transporting and storing the data. Lossy compression techniques present the opportunity to drastically reduce the volume of data while maintaining or even improving upon the quality of the decisions made by AI. This talk presents a survey of novel research examining the effect of lossy compression on AI decision making in a variety of domains including medical data science and physics. We examine the effects of data ordering, error bounds, compression methodologies to make general recommendations from these and other areas regarding how to effectively leverage lossy compression in AI using the common LibPressio interface.
TBD
CyberInfrastructure (CI) is increasingly harnessed to support the use of artificial intelligence (AI) techniques and applications to enable smart cities, power grids, computational simulations, real-time processing in big-data experiments (LIGO), supercomputing, and other important innovations. Computation is essential to developing and using the requisite analytics, while CI networks and data facilities are central to moving and managing the immense volumes of data. One might refer to this as CI for AI. Just as AI offers promise to revolutionize all aspects of our lives, the potential for improving the design, evolution, optimization, performance, operation, security and sustainability of CI is equally promising. We refer to AI applied to challenges of improving CyberInfrastructure as AI for CI.
A convergent research approach is required to explore the application of AI for CI, comprising experts from across the spectrum of cyberinfrastructure research, design, and deployment; computer and computational sciences; and systems design and engineering. To this end, we propose to create an AI for CI research activity to drive transformative exploration of CI using AI methods.
What AI methods are best for what purposes? What training is available? What are common research challenges identified across the exemplar CI families above all share some common themes: 1) the systems are so complex that only domain experts can make reasonable and repeatable optimization decisions and the complexity is growing exponentially, 2) decision making should be performed with minimal lag time for optimum performance but the consequences of those decisions, in terms of human welfare and/or monetary cost, can be high (e.g., loss of life and/or significant money), 3) processing telemetry data is a “big data” problem in itself with requirements for low latency time to solution in the presence of perhaps incomplete data, 4) individual data values are bounded but the high dimensionality and variability in aggregate makes full ensemble processing, within effective latency bounds, currently impossible, 5) social aspects of trusting the judgment of a machine must be overcome for significant deployment, 6) workload and resource scheduling, swarm coordination/management, resiliency, energy efficiency, ease of use by developers and end user and explainability, and 6) lack of labeled datasets as well as the high-cost of generating labeled datasets for training.
We will describe how the CANDLE/Supervisor system is used for configuration tuning using hyperparameter search and optimization. We will also illustrate its use in the analysis of underlying training data, including noise analysis and the search for important data to the application results. Performance and concurrency results will also be shown from running the system for cancer benchmark applications on the Summit system.
Multiple questions relevant to the Workflows and I/O projects will be raised including: How to manage storage for data pre-processing for model training; how to schedule various tasks in the training workflow; how to manage logical "working sets" of input and output data across runs efficiently.
A protein's structure determines its function. Different proteins have different structures and proteins in the same family share similar substructures and thus may share similar functions. Additionally, one protein may exhibit several structures (or conformations). Classifying different protein conformations can help solve problems such as determining the cause of diseases and designing drugs. X-ray Free Electron Laser (XFEL) beams are used to create diffraction patterns (images) that can reveal protein structure (and function).
The translation between diffraction patterns in the XFEL images and protein structures is non-trivial. In this talk we present the first steps into the design and assessment of a software framework for the classification of XFEL images. We call the framework XPC (XFEL-based Protein Classifier). We quantify the classification accuracy and performance of XPC for protein diffraction imaging datasets including realistic noise, different protein orientations and conformations. This project is collaborative research between UTK and RIKEN.
What is the suitability of the current framework for real-world cases (e.g., 3-D reconstruction of protein in general and ribosome in particular, annotation of images for their classification without human intervention) ?
What is the range of errors (i.e., distribution of orientation and/or conformation) tolerable for the scientists?
What are the costs for training and predictions?
Can we use systematic data from simulations to train the framework and use the trained models for classification/prediction of real experiment data successfully?
How can we augment the framework to also classify / predict protein structures (currently the framework captures orientations and conformations)?
This work introduces the notion of state preservation, which generalizes checkpointing into productive use cases. It focuses in particular on deep learning applications, which exhibit significantly different behavior at large scale compared with traditional HPC applications. This has important consequences for state preservation, which will be discussed by the talk, concluding with a series of early results.
How deep learning application can use model checkpoints and dependencies between them to build better training strategies? How can caching of training data reduce sampling overhead? How can nodes collaborate to build distributed checkpoints without involving the parallel file system?
Workflow systems promise scientists the ability to accelerate the scientific process by coordinating multiple computing tasks. Optimizing and controlling workflow are vital to this process, and failing to do so can significantly delay scientific discoveries and waste precious resources. Scientists often configure workflows manually with trial and error based on their expertise. In this talk, we will discuss how AI techniques can be used to automatically trigger changes in the workflow with the goal of improving performance. We hope that this talk will set up potential collaboration(s) towards this goal.
How AI techniques can be used to automatically trigger changes in the workflow with the goal of improving performance and resource utilization? We hope that our talk will set up potential collaboration(s) towards this goal.
Earth system models (ESMs) have increased spatial resolution to achieve more accurate solutions. As a consequence, the number of grid points increases dramatically, so an enormous amount of data is produced as simulation results. However, some ESMs use inefficient sequential I/O schemes that do not scale well when many parallel resources are used. This issue is typically addressed by adopting scalable parallel I/O solutions.
This work analyzes and improves the I/O process of the Integrated Forecasting System (IFS), one of the most important atmospheric models used in Europe. IFS can use two different output schemes: the MF I/O server and an inefficient sequential I/O scheme. The latter is the only scheme that can be used by non-ECMWF users. Additionally, these two output schemes originally attached in IFS produce the outputs in GRIB format, which is not the standard used by the climate community.
Therefore, it is presented an easy-to-use development that integrates an asynchronous parallel I/O server called the XML I/O Server (XIOS) into IFS. XIOS offers a series of features that are especially targeted for climate modelling: netCDF format data, online diagnostics and CMORized data.
oreover, an analysis is done to evaluate the computational performance of the IFS-XIOS integration, proving that the integration itself is adequate in terms of computational efficiency, but the XIOS performance is very dependant on the output size of the netCDF files. Therefore, the HDF5 compression is tested to try to reduce the size of the data so that both I/O time and storage footprint can be improved.
However, the default lossless compression filter of HDF5 does not provide a good trade-off between size reduction and computational cost. Thus, it is necessary to explore lossy compression filters that may allow reaching high compression ratios and enough compression speed to considerably reduce the I/O time while keeping high accuracy.
What types of lossy compression are more suitable for weather forecast and climate prediction? Can we use same compression for all variables?
In particular, we are interested in the SZ compressor from ANL. Do you think it is suitable for our needs? Is it compatible with parallel I/O?
Thus, might there be a potential collaboration between ANL and BSC?
The SZ compressor is already registered as a third-party filter of HDF5. We would like to explore with ANL if it is enough for XIOS, or we would need to develop a particular solution.
In HPC platform, concurrent applications are sharing the same file system. This can lead to conflicts, especially as applications are more and more data intensive. I/O contention can represent a performance bottleneck. The access to bandwidth can be split in two complementary yet distinct problems. The mapping problem and the scheduling problem. The mapping problem consists in selecting the set of applications that are in competition for the I/O resource. The scheduling problem consists then, given I/O requests on the same resource, in determining the access order to those accesses to minimize concurrency minimize the I/O time. The first part of our work takes into account the iterative behavior of HPC applications to design a model and study list-scheduling heuristics to the scheduling problem. In a second part, we design a data-aware partitioning of the platform for addressing the mapping problem. For building the solution of this problem we use a continuous model, built to represent a steady-state, in order to balance the application bandwidth on the whole machine.
Future works may involve extending the model, discussing sensibility of the policies, characterizing on which workloads our policy work better,.
e may also try to work from a model with exclusive access to bandwidth to ressource sharing.
Distributed digital infrastructures for computation and analytics are now evolving towards an interconnected ecosystem allowing complex applications to be executed from IoT Edge devices to the HPC Cloud (aka the Computing Continuum, the Digital Continuum, or the Transcontinuum). Understanding end-to-end performance in such a complex continuum is challenging. This breaks down to reconciling many, typically contradicting application requirements and constraints with low-level infrastructure design choices. One important challenge is to accurately reproduce relevant behaviors of a given application workflow and representative settings of the physical infrastructure underlying this complex continuum. In this paper we introduce a rigorous methodology for such a process and validate it through E2Clab. It is the first platform to support the complete analysis cycle of an application on the Computing Continuum: (i) the configuration of the experimental environment, libraries and frameworks; (ii) the mapping between the application parts and machines on the Edge, Fog and Cloud; (iii) the deployment of the application on the infrastructure; (iv) the automated execution; and (v) the gathering of experiment metrics. We illustrate its usage with a real-life application deployed on the Grid'5000 testbed, showing that our framework allows one to understand and improve performance, by correlating it to the parameter settings, the resource usage and the specifics of the underlying infrastructure. This work will be presented at IEEE Cluster 2020.
Future work and collaboration may involve:
(i) Enable built-in support for other large-scale experimental testbeds (besides Grid’5000) such as Chameleon and more.
(ii) Leverage E2Clab to build a benchmark suite for applications running on the Computing Continuum.
(iii) Help E2Clab evolve towards an emulation framework for applications in the Edge-to-Cloud Computing Continuum. After reviewing the literature, we observed that existing solutions (emulation and simulation tools) focus just on the Cloud, Fog, or Edge. E2Clab could be the first one to support the Edge-to-Cloud Continuum. Currently, E2Clab supports network emulation, and it could be extended to emulate common behaviors of applications in the Edge-to-Cloud Continuum, such as resource constrained devices, failures (network, services, etc.), and mobility.
(iv) Deploy more real-life use cases using E2Clab to perform large-scale experiments; and extend E2Clab's features (if needed) to support the new use case. Do you have any use cases to share?
Large-scale applications and workflows in scientific domains such as weather forecast or cosmology have increasing I/O needs. However, the storage has been designed for decades as a global shared resource, making I/O one of the main bottleneck of HPC systems. To mitigate this congestion, new tiers (node-local SSDs, burst buffers, network-attached storage, and so on) have been added to recently deployed supercomputers increasing their complexity. An open question is how to provision those resources to the jobs. Indeed, it is common for HPC systems to provide dynamic and exclusive access to compute nodes through a batch scheduler but little has been done for dynamically provisioning storage resources. In this talk, we present our preliminary work on scheduling techniques for storage resources.
- Scheduling algorithm minimizing I/O interference between 'storage jobs'
- We are also seeking access to raw network-attached storage (NVMeoF). Does anybody have that?
In this work, we are interested in processing "2nd generation" applications on HPC facilities.
The main properties of those applications are a huge variability in terms of both execution time and memory usage.
For example, such applications can be found in Neuroscience and ML. They can take a (deterministic) walltime of few hours to few days, depending on the input.
This variability becomes an issue when those applications are submitted on HPC platforms, which requires a good knowledge of application walltime. Similarly, the variability in memory consumption is a limitation when using cloud facilities.
To execute such applications on HPC platforms, usual way is to "guess" a reservation time, based on logs of previous runs.
However, due to the variability in walltime, strategies such as requested the maximum walltime so far observed or periodic strategies are inefficient and costly.
They either overestimate the walltime, leading to resource wastage or underestimate it, leading to wastage in computations (job is killed before terminating).
In our work, we propose to model this decision process and provide an optimal "sequence of reservations" given a probability distribution of job's execution time.
We also extend this approach by including resilience in some well-chosen reservations.
By proposing new scheduling approach, we validate the performance of our strategies by using both simulation and real experiments on Neuroscience applications developed in Vanderbilt University.
In this talk, we will present our approach of this problem, as so as some results and future works.
We expect to discuss with the community about
1) our approach and possible new ideas for managing such applications
2) what are the needs in terms of application, scheduler and runtime to manage such applications and how to implement it on HPC clusters
We would like to discuss with sys admins and app developpers about this problem.
3) future directions and collaboration opportunities
The Melissa-DA framework, inherited from Melissa, a tool for highly parallel massive sensitivity analysis, allows to perform ensemble based Data Assimilation with thousands of Ensemble Members at scale. Melissa-DA builds up to a distributed high performance computing application made of many small jobs that dynamically connect to a server job. All parts of the distributed application can abort independently, e.g. to be restarted from the last checkpoint using more or less compute resources to adopt to compute resource availability. Checkpointing is based on the Fault Tolerance Interface (FTI). The assimilation’s update step is based on the Parallel Data Assimilation Framework (PDAF).
- Application of similar workflows to solve different problems (e.g. Inverse Problems) using huge ensembles and Monte Carlo Methods
Automatic Differentiation (AD) is an increasingly common method to obtain gradients of numerical programs, for use in machine learning, optimization, or uncertainty quantification. An emerging challenge for AD is that numerical programs are written in a growing number of parallel dialects, such as OpenMP, CUDA, OpenCL, Kokkos. It seems desirable to have a model, or intermediate representation, that would allow AD tools to reason about the differentiation of general parallel programs, so that new parallel dialects only require a new frontend, not a new AD tool. We would like to investigate intermediate languages that are general enough to capture commonly used parallelization features used in practice, while being restrictive enough to allow efficient differentiation and code generation.
What abstractions (PGAS? Something else?) are most useful to express parallelism for the purpose of AD? How can we best implement AD on these abstractions? How can we turn the AD'ed intermediate representation back into e.g. efficient CUDA/OpenMP code? A collaboration with e.g. the JLESC participants working with OmpSs2, Argobots, etc, could be useful in answering these questions.
Solving time-dependent differential equations in parallel has been studied in many ways. The standard way of solving these equations is sequential in the time variable, but parallel in space variables. Therefore, better mathematical concepts are needed to fully exploit massively parallel architectures. For the numerical solution of time-dependent processes, time-parallel methods have opened new ways to overcome scaling limits.
In this talk, we first discuss how to form N coupled equations originating from a collocation problem, for N time intervals which are needed to reach the desired solution. Second, in the linear case, a diagonalizable preconditioning operator is used and standard Richardson iterations are performed. This allows us to independently solve decoupled problems on each processor of the same size as in the sequential approach. The goal is to reach a global solution on a longer time interval in no more than the number of iterations needed for one sequential propagation on a smaller interval. First prototype results are shown for the linear case. Also, a nonlinear approach is discussed.
We are looking for both interesting applications of the method and new ideas to improve our preconditioners, in particular for nonlinear problems where solid and reliable mathematical foundation is still missing.
Within the HPC community a lot of work has been done to optimize matrix-matrix multiplications, also known as GEMM. This function is well known as the most optimized function, and is able to obtaining close to peak performance. While there exists a lot of implementations and libraries for matrix-matrix operations, the more generalized class of operations, tensor-tensor operations and in particular non-standard tensor-tensor operation, has not received as much attention yet. While there exists a few libraries that cover the basic cases, there exists tensor-tensor operations that are not covered by these basic cases. Analyzing the kernels in our simulation code lead to identifying three properties that vary from the basic cases: i) Multiplying more then two tensors, ii) index manipulations, e.g. an index shift, iii) dealing with different tensors types, e.g. multiplying a symmetric tensors, where only one half is stored, with a non-symmetric tensor. In this regards, there are several important open questions: how can we optimized these classes of operations by apply the lessons learned from optimizing GEMM: adjusting the loop order, blocking and packing for the different cache levels and registers, calling an highly-optimized kernels at the inner most loop, and parallelize the surrounding loops.
- General Guidelines for optimizing tensor tensor multiplications that can not be mapped to available numerical libraries
- Exploiting the architecture specific parameters to tune the blocking of the algorithm
Quantum computing is computing using quantum-mechanical phenomena, such as superposition and entanglement. It holds the promise of being able to efficiently solve problems that classical computers practically cannot. The objective of QOC is to devise and implement shapes of pulses of external fields or sequences of such pulses that reach a given task in a quantum system in the best way possible. QOC requires the computation of adjoints using reverse mode automatic differentiation (AD). Current approaches to QOC do not exploit the fact that the conjugate transpose of the unitary matrix is also its inverse. Naive implementation of reverse mode AD requires the storage of unitary matrices of size 2^q *2^q where q is the number of qubits. Current approaches to QOC do not exploit the fact that the conjugate transpose of U is also its inverse. We investigated strategies for reverse mode AD that exploits this fact and apply it to a QOC problem.
We have shown that the overall error in exploiting the reversibility of unitary matrices could be related to the error in computing a matrix exponential using the Padé approximation approach. We are currently implementing QOC within JAX, a high-performance machine learning framework. As part of this effort, we expect to exploit the reversibility of unitary matrices.
Shared Infrastructure for Source Transformation Automatic Differentiation"
Over the past eight years of the operation of Blue Waters a vast amount of data has been collected for all aspects of the system. This data set includes system logs, IO profiles (Darshan), and performance counter data for all processors, memory, IO and networks. NCSA has now made this data set available for public download and use to enable research on the performance and resilience of large-scale systems. This presentation will describe the Blue Waters data set and give a few highlights on the reliability and availability characteristics of the system.
The quantity and breadth of data on all aspects of the operations of Blue Waters opens the door to investigations in areas of system resilience and tool development. NCSA is interested in collaborating on projects that can use the data set to develop tools for the automation of fault detection and mitigation and on tools that would automate the process of identifying problematic application behavior.
This talk introduces early results on the use of machine learning for the purpose of fine-tuning multi-level checkpointing, using VeloC as a reference checkpointing framework. The talk will first introduce VeloC, insisting on the configurable parameters (notably checkpoint interval). Then, it will present a technique that uses partial simulation to find the optimal checkpoint intervals for specific configurations, while relying on machine learning to train on these configurations and infer the optimal checkpoint interval for related configurations.
Alternatives to optimize checkpointing intervals? Other parameters/fine-tuning opportunities based on machine learning?
HPC machines landscape continuously evolves toward more performant and heterogeneous systems of specialized hardware. It is expected that a new machine will enable better performance. However, efficiency, i.e the achieved fraction of the system performance upper-bounds should also be considered when evaluating the machine ROI. Indeed, aside from shining in HPL rank, high-end HPC systems will also have to run science on a variety of workloads. These workloads will stress different parts of the machine and expected gains should be set accordingly. In order to set expectations, it is necessary to characterize optimized applications bottlenecks and benchmark the system on this specific aspects. In 2014, the Cache Aware Roofline Model (CARM) set a baseline for representing applications performance relatively to HPC compute nodes performance. Bounds represented in the CARM were the memory hierarchy bandwidths and floating point units throughput. In 2019, we augmented the model, taking into account locality. We also provided a finer characterization of achievable memory bandwidth under constraints such as application read/write mix or the fast/slow memory allocation mix. Lately, we started exploring applications memory access pattern and match it with the machine topology to evaluate applications time to transfer its data from main memories to compute units. Our goal is to be able to evaluate a fine upper-bound of achievable memory bandwidth with a new hypothetical computing system and known application.
As our priorities have shifted, we have only little time to involve carrying this research. We are looking for interested collaborator to take ownership of the project.
The DGEMM function is a widely used implementation of the dense matrix product. While the asymptotic complexity of the algorithm only depends on the sizes of the matrices, we show that the performance is impacted in a stastically significant way by the matrices content. Our experiments show that this may be due to bit flips in the CPU causing an energy consumption overhead, which in turn forces the CPU to lower its frequency.
This effect has been observed on most of the recent Intel CPU generations.
This phenomenon has been noticed thanks to a meticulous experimental methodology. Our approach allows us to better study and understand the spatial and temporal variability of supercomputers, a key point when applications need to be executed at large scale.
Collaboration opportunity with Argonne National Laboratory.
Tuning configurations of a scientific application is not a trivial task, even more, when the tool can have more than 100 parameters ranging from algorithm parameters (like the numeric error limit of the solver) to HPC related parameters (like the number of processes to use). The literature has several approaches for the automatic tuning of workloads, from Direct Search Methods to Model-based optimization or Genetic Programming. State of the art autotuners is based on iteratively evaluating jobs with different configurations, choosing smartly the samples to find good candidates fast. Most optimizers are based on black-box optimization processes which do not take any information from the workload (besides an app identifier or some other extra feature like the dataset size). Scientific workloads can take a considerable amount of time, making the optimization process, even for a small number of runs, not ideal.
In this talk, we will present a new method to tune scientific engineering workloads that accomplish a good performance from a single run of the job. To achieve this, we mine the logs that are generated by the applications after a single run of the workload. Since the logs contain a lot of internal information from the execution, we can build a model that is trained on low-level features from the application, building a descriptor from that to learning algorithms. This allows our system to predict sensible configurations for a large number of parameters from unseen jobs, given that it has been trained with reasonable coverage of applications. Experiments show that the presented system can correctly produce good configurations from a single run, making it more practical to deploy on HPC environments, archiving significantly speedup with respect the default configuration and gain over other technics.
The current results show good accuracy in the prediction of the execution time of the workloads for fixed given hardware. We can do a transfer learning to other HW that we have benchmarks to extrapolate, but how we can extend the work in a more general way? The next generation of systems (Exascale) poses new complexity with exponential possibilities of new architectures (non-volatile memory, accelerators etc.).
Besides not being the main focus of the talk and poster proposed, as most of the reservoir engineering application in O&G uses space matrices, we can collaborate in the Numerical Methods project by sharing matrices of this specific domain and/or in discussions of techniques used in our space parallel linear solvers.
Performance in different CPU architectures can be monitored by reading the occurrences of various hardware events. However, from architecture to architecture, it is not clear which native hardware events are indexed by which event names, making it difficult for the performance analyst to understand how to measure specific events.
To alleviate this difficulty, the Counter Analysis Toolkit (CAT) aims to measure the occurrences of events through a series of benchmarks, classify particular events of interest via data analysis of the patterns of event occurrences, and allow its users to discover the high-level meaning of native events.
How can we automatically classify native events from architectures with fundamentally different specifications (e.g. differing numbers of cache levels and cache line sizes)?
What are robust classification models which can work based on a few data samples?
For a given architecture, there may only be a few of event names which index a native event. But our goal is to find similarities between event occurrence patterns across architectures. How can we reconcile these two concepts? There is a need for models which can recognize nuanced patterns while being trained on a relatively small number of samples.
Run-to-run performance variations originating from hardware deficiencies result in severe execution inefficiencies for HPC applications on current leadership computer systems with increasing impact at larger scales. Specific actual cases will be summarised, along with potential steps for amelioration.
How to identify affected applications? How to identify (and replace) deficient hardware components in a timely fashion? How much inefficiency is acceptable for large-scale HPC? Collaboration may involve system monitoring and application performance reporting, as well as application strategies for amelioration.
In the dawn of exascale computing, such code metrics as arithmetic intensity are increasingly important for codes which heavily utilize numerical kernels. However, confusing native event names in modern architectures render certain bandwidth-related events difficult to identify. Additionally, the IBM POWER9 architecture in the Summit supercomputer not only natively supports single- and double-precision, but also quadruple-precision computation. Thus, there are multiple categories of arithmetic intensity.
The Uncore and Performance Co-Pilot (PCP) components in PAPI allow users to measure off-core events for such architectures as Intel Broadwell, Intel Skylake, and IBM POWER9 architectures. We leverage these components to explore the CPU’s off-core to rigorously define arithmetic intensity via native hardware events.
The HPC software landscape has never been more intricate, as the number of viable alternatives for writing HPC applications has risen in the recent years. MPI + OpenMP (FORTRAN or C), Kokkos, Raja and the newer SYCL/DPC++ are all valid candidates. Fortunately, those high-level APIs target low level back-ends that provide their own APIs. Understanding the interrelations between the high-level programming models and the low-level programming models is of paramount importance if we are to understand, debug, project, and optimize application performance.
In this talk we will present a strategy for model-centric tracing of low level APIs and illustrate it with the case of the OpenCL API. We will show some of the benefits our approach allows, such as: in-depth profiling, lightweight monitoring, computing kernel extraction, or validation of the OpenCL state.
Applications of model-centric low-level tracing of HPC APIs are wide ranging. Collaborations could be encompassed with teams doing simulation, or runtime systems for HPC. Online platform and software monitoring is also a topic we could build collaboration on.
I will present the main research directions of my thesis about interactions between task-based runtimes and communication layer. Four general ideas will be explored during my thesis: optimisation of collective communications, improving polling policy according to the activity of the runtime, anticipating future communications and defining communication patterns produced by task-based runtimes.
Looking for applications with specific communication patterns and methods/tools to analyze performances.
Hierarchical parallelism is everywhere. To exploit the full range of parallel resources, different parallelization approaches are required. Intra-core resources (SIMD) and intra-node resources (threads) can be handled directly in modern languages like C++. However, for distributed resources no direct language support exists. Typically, third party libraries like MPI or PGAS frameworks like HPX are required to exchange data. For compute-bound problems these approaches are well suited. However, if the runtime is dominated by overhead (communication, synchronization, latency) both approaches pose difficulties. Additionally, using MPI in a multi-threaded environment (with any thread sending or receiving) does introduce additional overhead.
In this short talk we present the current state of our FMSolvr library in the light of hierarchical parallelism and its limitations.
- alternative approaches beyond MPI for low-latency inter-node CPU-CPU communication
- support for low-overhead multi-threaded communication
- support for CPU-GPU communication
- support for GPU-GPU communication
Hybrid parallel programming combines distributed-memory and shared-memory programming models to exploit inter- and intra-node parallelism, respectively. However, such models, like the current standards of MPI, GASPI, and OpenMP, are not explicitly designed to be combined. Application developers tend to simply serialize communication phases, which reduces the effective parallelism of applications and introduces artificial synchronization points among processes at phase's switching.
The Task-Aware MPI (TAMPI) library improved the programmability and performance of hybrid applications by facilitating the taskification (with OpenMP and OmpSs-2) of blocking/non-blocking MPI communications. Following that idea, we present the Task-Aware GASPI (TAGASPI) library, which focuses on improving those same aspects when taskifying GASPI's one-sided operations.
In this way, app developers can easily taskify both computation and communication phases allowing the natural overlap of computation and communication tasks. Task dependencies declared over the real data structures guarantee the correct execution order of those tasks. Furthermore, with this approach, there is no need for extra synchronization points between different phases or abstract dependencies for serializing communication tasks.
We would like to open new collaborations with the objective of porting applications or mini-apps to leverage the TAGASPI library, or even TAMPI. Those applications would probably improve their programmability and performance, and we would have the opportunity to study new use cases for our libraries and models.
HPC aims at maximizing performance of an application on a given architecture. The classical optimization work flow is based on identifying hot spots or bottlenecks and on optimizing them before making new performance evaluations. Natural design of HPC applications reflects this model where a set of basic building blocks is identified for its performance penalties if not optimized. Most of the HPC applications are using standard basic building blocks such as dense linear algebra kernels or FFT embedded into libraries or frameworks.
Our medium term goal of our project aims at providing a methodology to answer the question of how to productively achieve high level of performance of complex software in presence of code evolutions and architecture variations for domain developers. A major scientific objective of the project is then to propose a programming model that enables flexible and maintainable HPC applications to make them robust to code evolutions and architecture variations with feedback to the user.
The presentation will focuse on our recent experiences motivating our project :
- efficient composition of BLAS subroutines on multi-GPUs node (typically Nvidia DGX-1)
- tracing tools for OpenMP programs based on OMPT
.
We are going to present two challenging questions related to the runtime:
1. Can runtimes sustain high level of performance in presence of complex building block composition with thousands of potential concurrent activities?
2. How to report performance between execution and the high level description of the application?
I will present the latest advances in OmpSs, OpenACC, use of heterogeneous memory systems and tasking in GPU carried out at the Accelerators and Communications for HPC team at BSC.
Seeking mainly for users, but also research partnership.
There is now a large and ever growing number of workflow systems, and we have lost hope in encouraging users not to continue developing more. Instead, we want to focus on building shared elements that can help us with our own systems, as well as the users of those systems and the developers of applications that will increasingly be used as workflow elements in simulation, analysis, search, optimization, and parameter study research campaigns.
As discussed briefly at JLESC 9 (in ST A1.1. Python-based Workflows & Software Sustainability), two of the common types of elements that workflow systems interact with are the end computing systems and the preexisting applications that the workflows wrap and call. Today, users of a workflow system have to find information about both the end points and the applications, they have to map that information to workflow-specific configuration formats, individually customize their workflow to use these configurations, and keep up with changes over time. Instead, we propose a registry of compute end points and applications, where an entry could be automatically brought in to a workflow system.
This requires a pair of components:
1 - The registry itself and a means for adding and editing entries, potentially along with curation, or perhaps community curated, using WikiData
2 - A means to use entries for a given workflow system
Registry entries could be added by three different groups:
1 - Compute resource providers could enter their systems, and application developers could enter their applications
2 - Workflow system providers could enter systems and applications that they support, or we could collect published configurations and map them to our common schema
3 - Workflow developers could enter system and applications that they and their workflow users need
The initial work that we propose here is
1 - defining the schema for the registry, and implementing it as a REST service
2 - building some test elements, and entering them manually
3 - building software for Parsl and PyCOMPs to import and use registry entries
We plan to do this initial work in the summer of 2020.
Open questions:
How to best represent computational end points and applications
How to best build a registry of these elements, and curate the information in it
How to bring this information into workflows specific to different workflow systems
We are now looking for any additional groups who would like to be involved in this project, potentially application developers whose applications are used in multiple workflow systems, owners/managers of HPC systems that are used by multiple workflow systems, information science/systems experts, and workflow system developers
We provide insights into the development of a fine-grained tasking for GPUs. The short talk starts out with an overview of our CUDA C++ tasking framework for fine-grained task parallelism on GPUs. After a short trip into the world of persistent threads, synchronization mechanisms and load balancing, we dig a bit deeper by addressing the performance-question. On that point, we examine several locking mechanisms provided by libcu++, Nvidia's freestanding Standard C++ library for GPUs. For each locking mechanism, we analyze its performance gains on an Nvidia Tesla V100 for a prototypic implementation of a task-based fast multipole method for molecular dynamics.
We are looking for collaborations in the areas of
- Message passing for multi-GPU tasking
- Message passing for tasking on heterogeneous CPU/GPU architectures
- Fine-grained tasking on FPGAs
The Message Passing Interface (MPI) plays a critical part in the parallel computing ecosystem, a driving force behind many of the high-performance computing (HPC) successes. To maintain its relevance to the user community, in particular to the growing HPC community at large, the MPI standard needs to identify and understand the MPI users' status, concerns and expectations, and adapt accordingly. This survey was conducted from February 2019 by using online questionnaire frameworks, and more than 850 answers from 42 countries has gathered at the time of this writing. There are some predecessors of this kind of survey, however, our survey is much more scaled in terms of geographically and the number of participants. As a result it is possible to illustrate the current status of MPI users more precisely and accurately.
This survey is still open because we hope to increase the number of participants, especially from some countries, including USA, Japan, Spain and China. I would like you to help and/or join us to get more participants especially from those countries and to write the final report.
MPI libraries have bad performance when MPI_THREAD_MULTIPLE is used. This is due to the coarse grain locking that is used by MPI to ensure the atomic execution of critical operations; and by the lack of communication between MPI and the thread scheduler: The completion of a blocking communication does not inform the thread scheduler which thread has become runnable. We have developed LCI, a communication library that use finer grain locking and that interact direction with a light-weight thread scheduler, ushc as Argobots. The library has been use to improve the performance of D=Galois, a graph analytics library and we are currently poorting PaRSEC to LCI.
Our goal is to port additional scientific libraries on top of LCI in order to validate our desgin and improve their performance.I particular, we plan on a collaboration with Ivo Kabadshow from Julich in order to port his FMM library to LCI
Many of today’s HPC scheduler rely mostly on precise information on the applications runtime (reservation-based scheduling). New stochastic profiles of applications (Big-Data/ML). Scheduling as usual is not an option anymore. Need to change the scheduling paradigm. Speculative scheduling: reservation strategies for job whose execution time follows probability distribution. We have shown that with simple implementations, it brings large benefits, both for the (i) platform utilization, and (ii) the average job response time.
- understanding the type/behavior of new applications coming to HPC systems
- understanding the needs both from a sysadmin perspective, and from a user perspective of future batch schedulers
Update on the SVE activities done in the Mont-Blanc 2020 project
I would like to collaborate on benchmarking HPC applications on the first SVE processor (i.e. Fujitsu's post-K)