The workshop gathers leading researchers in high-performance computing from the JLESC partners INRIA, the University of Illinois, Argonne National Laboratory, Barcelona Supercomputing Center, Jülich Supercomputing Centre, RIKEN R-CCS and The University of Tennessee to explore the most recent and critical issues in advancing the field of HPC from petascale to the extreme scale era.
The workshop will feature sessions on these eight central topics:
In addition to these tracks, dedicated sessions targetting more specialized scientific domains are planned. The target domains change for each meeting depending on the needs and interests of the JLESC community. For this meeting the target domains are computational fluid dynamics, computational biology and climate/weather research.
A key objective of the workshop is to identify new research collaborations and establish a roadmap for their implementation.
The workshop is open to all participants from the JLESC institutions Illinois, INRIA, ANL, BSC, JSC, Riken R-CCS and UTK; faculties, researchers, engineers and students who want to learn more about Post-Petascale / Pre-Exascale Computing.
Ruth Schöbel is working at Jülich Supercomputing Centre, Forschungszentrum Jülich, Germany. She studied Mathematics at RWTH Aachen University and is also trained as a mathematical-technical software developer. During her PhD time (2016-2021) she worked together with the Technical University of Dresden, Germany, in the field of parallel-in-time integration methods for partial differential equations. Currently she is involved in a project that uses reinforcement learning enabled by differentiable programming to optimize numerical preconditioning. She is specialized in numerical mathematics and has developed a particular interest in semi-implicit time-stepping methods for differential equations. She has been an active JLESC member since 2017 and leader of the Early Career Committee since 2019.
Alan Ayala received a M.S. degree in Applied Mathematics from Pierre et Marie Université, and a PhD. from Sorbonne Université and Inria-Paris. He is a research scientist at the Innovative Computing Laboratory (ICL) at the University of Tennessee in Knoxville. Currently, Dr. Ayala's research focuses on the development of heFFTe library for FFT computation on upcoming exascale systems, and the Benchmarking Harness for state-of-the-art parallel FFT libraries.
George Bosilca is a Research Assistant Professor and Adjunct Assistant Professor at the Innovative Computing Laboratory at The University of Tennessee, Knoxville. His research interests evolve around distributed computing concepts and designing support for parallel applications to maximize their efficiency, scalability, heterogeneity and resiliency at any scale and in any settings. He is actively involved in projects such as Open MPI, ULFM, PaRSEC, DPLASMA, TESSE.
Robert Underwood is graduating with his PhD from Clemson University in South Carolina, USA. During his PhD program, Robert focused on making lossy compression more approachable through scientific and engineering contributions. He developed methods to: determine configurations of lossy compressors that satisfy user's analysis constraints, parallelize compression routines, and quantify the effects of lossy compression on ML and AI training data on results. He has also made engineering contributions through the library LibPressio which provides a consistent interface across over 29 lossless and lossy compressors and the improvements to the SZ compressor framework. His current work focuses on using lossy compression to empower AI-driven scientific workflows. He has been an active member of JLESC since 2020.
Daichi Mukunoki is a research scientist at the Large-scale Parallel Numerical Computing Technology Research Team in RIKEN Center for Computational Science (R-CCS) (Japan) since 2019. He received his Ph.D. degree in computer science in 2013 from University of Tsukuba. He was a research fellow of the Japan Society for the Promotion of Science in 2013-2014, a postdoctoral researcher at RIKEN Advanced Institute for Computational Science (current R-CCS) in 2014-2017, and a postdoctoral research fellow at Tokyo Woman's Christian University in 2017-2019. His research interests include performance optimization on many-core and distributed parallel architectures, mixed-precision computation, and accurate and reproducible computation methods.
Philippe SWARTVAGHER is a PhD student at INRIA Bordeaux Sud-ouest, France, under the supervision of Alexandre DENIS and Emmanuel JEANNOT. His work focuses on interactions between task-based runtime systems and communication libraries, to improve collaborations between these two layers.
Daniel is a PhD student in Data Science and Engineering and does research with the Innovative Computing Lab’s Performance Tools group. He received his Bachelors of Science in Computer Engineering at the University of Tennessee. Daniel's research interests include application performance monitoring, benchmarking applications, and large-scale data analytics.
Kevin Sala is a Ph.D. Student and Junior Research Engineer at the Programming Models group at the Barcelona Supercomputing Center (BSC). Passionate about parallel programming and high-performance computing (HPC), his research is focused on exploiting the synergies between distributed-memory and shared-memory programming models. He is actively working on the development of the OmpSs-2 programming model and the Task-Aware MPI (TAMPI) library.
Julian Clausnitzer received his M.Sc. degree in mathematics from the University of Göttingen and is currently a PhD student at Jülich Supercomputing Centre, Forschungszentrum Jülich.
His current research project is involved with solving stochastic partial differential equations on two-dimensional domains using an explicit Euler-like time-stepping scheme combined with a Galerkin approach and a boundary integral approach.
Joseph Schuchart is working at the Innovative Computing Laboratory (ICL) at the University of Tennessee, Knoxville. He studied Computer Science at the University of Technology Dresden and has received his PhD from the University of Stuttgart, Germany in 2020, where he worked on task-based runtime systems in the partitioned global address space. Joseph is currently working on Templated Task Graphs, a novel abstraction of compressed task graphs in C++, as well as various research topics related to MPI.
N/A.
N/A.
N/A.
Track 1 | Track 2 | BOS | |
08:00 ET | Opening Remarks (Location: Plenary)
|
||
08:15 ET |
ST M1.1 (8) Performance Tools
Session chair: Location: Breakout Room Track 1 |
ST M1.2 (8) Workflows/Execution/Resources/Facilities
Session chair: Location: Breakout Room Track 2 |
BOS: Heterogeneous and reconfigurable architectures for the future of computing
Session chair: Kazutomo Yoshii Location: Breakout Room BOS |
08:30 ET | |||
09:00 ET | |||
09:30 ET | |||
10:00 ET | |||
10:15 ET | Break (Back on main track on Zoom) | ||
10:30 ET |
ST M1.3 Speed Networking
Session chair: George Bosilca Location: Plenary |
||
11:00 ET | |||
11:30 ET | Meeting on Zoom |
Track 1 | Track 2 | BOS | |||||||||
08:00 ET |
ST M2.1 (6) Resilience
Session chair: Location: Breakout Room Track 1 |
ST M2.2 (6) Advanced Architectures & Portability
Session chair: Location: Breakout Room Track 2 |
BOS: In Situ Processing at Large
Session chair: Bruno Raffin Location: Breakout Room BOS |
||||||||
08:15 ET | |||||||||||
08:30 ET | |||||||||||
08:45 ET | |||||||||||
09:00 ET | |||||||||||
09:15 ET | |||||||||||
9:30 ET | Break Back on the Plenary track on Zoom | ||||||||||
09:45 ET |
|
||||||||||
10:00 ET | |||||||||||
10:15 ET | |||||||||||
10:30 ET | |||||||||||
10:45 ET | |||||||||||
11:00 ET | |||||||||||
11:15 ET | Meeting on Zoom |
Track 1 | Track 2 | Track 3 | BOS | |
08:00 ET |
ST M3.1 (5) I/O, storage and in-situ processing
Session chair: Location: Breakout Room Track 1 |
ST M3.2 (6) Applications
Session chair: Location: Breakout Room Track 2 |
BOS: Arm Architecture for HPC
Session chair: Mitsuhisa Sato Location: Breakout Room BOS |
|
08:15 ET | ||||
08:30 ET | ||||
08:45 ET | ||||
09:00 ET | ||||
09:15 ET | ||||
9:30 ET | Break Back on the Plenary track on Zoom | |||
09:45 ET |
ST M3.3 (6) AI/ML
Session chair: Location: Breakout Room Track 1 |
ST M3.4 (6) Numerics
Session chair: Location: Breakout Room Track 2 |
ST M3.5 (5) Compression
Session chair: Location: Breakout Room Track 3 |
|
10:00 ET | ||||
10:15 ET | ||||
10:30 ET | ||||
10:45 ET | ||||
11:00 ET | ||||
11:15 ET | Closing
Location: Plenary |
Title | Presenter | |
---|---|---|
Title | Name 1 Affiliation | |
Abstract: Abstract | ||
Title | Name 2 Affiliation | |
Abstract: Abstract | ||
Title | Name 3 Affiliation | |
Abstract: Abstract | ||
Title | Name 4 Affiliation | |
Abstract: Abstract | ||
Title | Name 5 Affiliation | |
Abstract: Abstract | ||
Title | Name 6 Affiliation | |
Abstract: Abstract |
Bio
Bio
Bio
Bio
Bio
Bio
This talk looks at proposals meant to improve both usability and efficiency of the MPI RMA interface. Some of the proposals include improved dynamic window memory handling, atomic operation effciency, and multi-threaded passive-target synchronization.
We are specifically looking for feedback of developers that are using or have previously tried to use the MPI RMA and input on what applications or higher-level runtime systems expect and require from the interface in an attempt to overhaul the API for the 5.0 release of the MPI standard.
In this presentation, I will introduce TTG, a novel programming model and its C++ implementation that by marrying the ideas of control and data flowgraph programming supports compact specification and efficient distributed execution of dynamic and irregular applications. Programming interfaces that support task- based executions often only support shared memory parallel environments; a few support distributed memory environments, either by discovering the entire DAG of tasks on all processes, or by introducing explicit communications. The first approach limits scalability, while the second increases the complexity of programming. I will demonstrate how TTG can address these issues without sacrificing scalability or programmability by providing higher-level abstractions than conventionally provided by task- centric programming systems, but without impeding the ability of these runtimes to manage task creation and execution as well as data and resource management efficiently.
TTG aims at relieving the difficulty of programming data-dependent applications or applications with irregular and imbalanced memory access. New applications could challenge the programming model and allow to extend it to support a large spectrum of cases.
In this talk we will show the use of CUDA-graphs to accelerate OmpSs benchmarks. With OmpSs, we can have a set of CUDA tasks recorded once within the CUDA runtime system, and replayed many times, thus reducing the amount of overhead introduced by the GPU management. In order to achieve this, we use the high-level CUDA graphs API.
We are open for collaborations in this field, for example with new ideas on how to use the CUDA graphs API, the development of additional benchmarks or applications with OmpSs, etc.
Finding the set of parameters that gives the best performance for a given application is very challenging. The exploration of the space spanned by the parameters to find the best set is combinatory and often requires unrealistic budget. To address this, some techniques, called auto-tuning, relies on Bayezian learning to create a model of performance.
We present in this talk an auto-tuning code based on deep Gaussian Process.
We show how the multi-task and multi-fidelity approaches can be mixed with the transfer learning technique to improve the method. Our experiments on the SLATE library demonstrate how with a limited budget, i.e., a fixed number of runs, we are able to obtain more accurate information than state-of-the-art codes.
We believe this auto-tuning code is a great opportunity to improve the performance of existing applications.
One of the most important categories of application performance is data traffic between the processing cores and the main memory. However, since the related performance event counters are not core-private, applications require elevated privileges to access them. PAPI offers a component that can access this information on IBM systems through the Performance Co-Pilot (PCP), but doing so adds an indirection layer that involves querying the PCP daemon. Using such benchmarks as DGEMM and STREAM, we study the accuracy of the measurements obtained through the PCP component on the Summit supercomputer.
Open Questions: Other than memory traffic events, what categories of "nest" (IBM equivalent of "uncore") events would be insightful to measure? Which scientific applications could benefit from measuring these events?
This presentation will introduce the DFG/RFBR projekt ExtraNoise (see https://www.vi-hps.org/projects/extranoise/overview/overview.html). Its goal is to develop performance measurement, analysis and modelling methods which are resilient to system noise. A second goal is to better understand the noise sensitivity of current HPC systems, applications and algorithms.
We would like to connect with other projects and groups which are working on system noise related issues.
The C Configuration Space and Tuning Library (CCS) is a framework created to bridge the gap between auto-tuning frameworks and application with auto-tuning needs. CCS offers a C API to describe rich configuration spaces and objectives, as well as the auto-tuners to optimize those problem. CCS was conceived with performance in mind, in order to be integrated in and driven by low level runtimes and applications (JIT, HPC runtimes, etc...). CCS also comes with a set of Python and Ruby bindings that allows prototyping auto-tuners in high level languages and invoking them from the low-level applications.
This presentation will introduce the features of CCS and describe the co-design work that happened with the Kokkos Tuning API.
Collaboration opportunities:
- teams working with runtime systems and with hyperparameters to optimize could be interested in interfacing with CCS
- teams working on auto-tuning strategies could be interested in steering CCS in a direction that allows describing their problems and auto-tuning strategies
Distributed digital infrastructures for computation and analytics are now evolving towards an interconnected ecosystem allowing complex applications to be executed from IoT Edge devices to the HPC Cloud (aka the Computing Continuum, the Digital Continuum, or the Transcontinuum). Understanding and optimizing end-to-end performance of applications in such a complex continuum is challenging. This breaks down to reconciling many, typically contradicting application requirements and constraints with low-level infrastructure design choices. One important challenge is to accurately reproduce relevant behaviors of a given application workflow and representative settings of the physical infrastructure underlying this complex continuum.
In this talk, I will introduce a rigorous methodology for such a process and its implementation in E2Clab, a framework supporting the reproducible and complete experimental cycle across the Edge-to-Cloud Continuum: deployment, monitoring, analysis, optimization. Next, I will illustrate our methodology by optimizing Pl@ntNet, a world-wide plant identification application.
Finally, I will briefly highlight the ongoing work on adding support to small wireless sensor edge devices provided by the very large scale FIT IoT LAB testbed, in E2Clab. Furthermore, I will present the research challenges on enabling provenance capture in Edge-to-Cloud experiments (How do the existing provenance capture systems perform in resource-constrained environments? How to efficiently capture provenance in this context?). The ultimate goal is that the data captured could provide support to the computational reproducibility and allow users to better understand how the experiment results have been produced.
How do the existing provenance capture systems perform in resource-constrained environments?
How to efficiently capture provenance in Edge-to-Cloud experiments? How have the experiment results been produced (e.g., hardware, software, and application configurations used)?
Potential collaborators: researchers aiming to understand and optimize the performance of their real-life use cases on large scale infrastructures in a systematic and reproducible approach. Do you have any use cases to share?
In this work, we study the problem of online adapting a geostatistics application, ExaGeoStat, to the best set of available heterogeneous nodes per application phase. ExaGeoStat is an iterative application that computes the MLE for spatial data. We consider the usage of the Gaussian Process (GP) to model the expected performance of a group of system-level heterogeneous nodes per application iteration. We show that by correctly modeling the GP trend, limiting the search space with a linear program prediction, and considering the heterogeneity with dummy variables, we can quickly adapt at runtime and reduce the total amount of computational nodes used.
The model is tailored for this application. We are interested in other task-based applications to evaluate the generality of our approach.
This presentation will cover EQ-SQL, a database-oriented backend for the EMEWS framework. This project is a derivative and generalization of components developed for COVID-19 modeling workflows built in 2020.
This topic touches on workflow interoperability, which may be of interest to others in JLESC.
In this short talk we will present an on-going initiative at INRIA to build a unified portfolio of HPC codes developed at INRIA, adopting good practice for improving productivity, sustainability and reproducibility, partly building on the E4S initiative.
Expect feedback from other partners involved in similar experiences, in particular about the E4S initiative.
joint work by Sophie Cerf 1 Raphal Bleuse 1 Valentin Reis 2 Swann Perarnau 2 Eric Rutten 1 (1 : Inria, 2 : ANL), Europar 21
Production high-performance computing systems continue to grow in complexity and size. As applications struggle to make use of increasingly heterogeneous compute nodes, maintaining high efficiency (performance per watt) for the whole platform becomes a challenge. Alongside the growing complexity of scientific workloads, this extreme heterogeneity is also an opportunity: as applications dynamically undergo variations in workload, due to phases or data/compute movement between devices, one can dynamically adjust power across compute elements to save energy without impacting performance. With an aim toward an autonomous and dynamic power management strategy for current and future HPC architectures, this paper explores the use of control theory for the design of a dynamic power regulation method. Structured as a feedback loop, our approach-which is novel in computing resource management-consists of periodically monitoring application progress and choosing at runtime a suitable power cap for processors. Thanks to a preliminary offline identification process, we derive a model of the dynamics of the system and a proportional-integral (PI) controller. We evaluate our approach on top of an existing resource management framework, the Argo Node Resource Manager, deployed on several clusters of Grid'5000, using a standard memory-bound HPC benchmark.
dynamic resource & jobs runtime management
Automnomic Computing, Feedback Loops, and Control Theory
We present a job management software, OACIS, with which you can manage jobs, conduct parameter scans, and inspect the results visually via an interactive interface. In my talk, I will present a demo of the software.
collaboration with other workflow systems
In recent years, power consumption in HPC data centers has reached tens of megawatt-scale of electricity due to the large scale of computing facilities. Most of the electricity consumed by IT equipment is converted into heat energy. Finally, the generated heat is exhausted by a cooling system that consists of mechanical devices with various technical constraints. Prior to the launch of the official service in Fugaku, we conducted benchmark campaigns to evaluate the entire performance. In this study, we analyze the detail of the cooling facility operations with operational metrics during the evaluation period.
We struggle with improving the operations by operational data analysis approach and are looking for opportunities to accelerate the project by collaborations. We welcome any and all ideas and attempts.
High Performance Computing systems are approaching exascale operations per second (10⏻18 FLOPS), significantly increasing in size and complexity of the systems. These developments also expand the variety and frequency of possible problems, demanding a continuous monitoring and analysis of the system and its usage during regular production. With this intent, the Jüulich Supercomputing Centre has been developing the LLview monitoring infrastructure, combining monitoring metrics from CPU, GPU, I/O and interconnect activities. With negligible overhead, LLview provides individualised fast access to job reports for users, mentors, support and administrators alike, which are then used to manage and support production runs. This presentation provides an overlook of LLview, a few success cases and perspectives for future enhancements and collaborations.
Exchange experience analysing and presenting report data, and how to use monitoring in daily HPC support activities. Discuss among centers about similar activities; get ideas on how to improve current available sources, metrics and presentation; discuss statistical analysis of collected performance data.
I will describe our approaches to running programs with power consumption
We will share our experience in operating large-scale facilities and aim at power-saving operation by utilizing hardware and facility functions.
For a single task, the Young/Daly formula provides the optimal checkpointing period to minimize the expectation of the total execution time. However, when many tasks execute simultaneously, the risk that one of them is severely delayed increases with the number of tasks. To mitigate this risk, a possibility is to checkpoint each task more often than with the Young/Daly strategy. This work studies the impact of checkpointing more than prescribed by the Young/Daly formula both on the theoretical and practical side.
- Is there a realistic workflow benchmark for parallel (rigid or moldable) tasks available for simulation? How can we find realistic instances combined with its failures laws? is there any JLESC partner interested in collaborating on this topic?
- In this work we focused on finding a good checkpointing strategy for a fixed givenschedule, but is it possible to design a global algorithm that schedules, maps and checkpoint tasks all in once?
We study the resilient scheduling of moldable parallel jobs on high-performance computing (HPC) platforms. Moldable jobs allow for choosing a processor allocation before execution, and their execution time obeys various speedup models. The objective is to minimize the overall completion time or the makespan, when jobs can fail due to silent errors and hence may need to be re-executed after each failure until successful completion. Our work generalizes the classical scheduling framework for failure-free jobs. To cope with silent errors, we introduce two resilient scheduling algorithms and show the approximation ratios for the two algorithms under several prominent speedup models (e.g., roofline, communication, Amdahl, power, monotonic, and a mix model). An extensive set of simulations is conducted to evaluate different variants of the two algorithms, and the results show that they consistently outperform some baseline heuristics.
While this works considers independent parallel jobs, one open question is study jobs with dependencies (i.e., workflows) and present efficient algorithms to cope with job failures in workflows. Another open question is to consider fail-stop errors and design effective strategies that combine scheduling and checkpointing. Collaboration opportunities exist amongst researchers working on resilience, system-level failure recovery, and workflow scheduling.
Portability and resilience are growing problems in high performance computing with different requirements that do not always get along. More systems are relying on accelerators for computational power while memory hierarchies expand to feed said accelerators. As supercomputers increase in scale and heterogeneity, efficiently leveraging the hardware becomes more difficult and the mean time between failures decreases. Portability abstractions such as Kokkos address the heterogeneity problem while checkpoint/restart systems such as VeloC provide resilience against system failures. We present a proof of concept that augments Kokkos applications using VeloC by enabling access pattern aware checkpointing (i.e., by leveraging application and system awareness to create efficient portable checkpoints).
The portability and checkpoint challenges present an interesting opportunity to leverage information from portability abstractions to augment checkpoint runtimes. Kokkos data abstractions contain information on how data is accessed. For example, whether concurrent accesses with many threads are done through atomics or data replication. These abstractions also simplify detecting more complex access patterns, such as sparse updates. The Kokkos Resilience project combines Kokkos abstractions with state-of-the-art checkpoint runtimes to provide checkpoint capabilities to Kokkos applications while remaining portable and minimizing constraints imposed on the developer. Kokkos Resilience uses VeloC, a scalable asynchronous checkpoint runtime, as the back-end.
We present the groundwork for integrating access pattern aware strategies into Kokkos Resilience abstractions and VeloC. Exploiting this opportunity will allow for more intricate checkpoint optimizations without the complexity normally required to implement and use them. These optimizations would also be portable and accessible by any Kokkos application. Specifically, we demonstrate the effectiveness of access pattern aware checkpointing by leveraging incremental checkpointing for sparsely updated data. We show how this approach allows for smaller and faster checkpoints while reducing complexity to users.
This project is collaborative research between ANL, SNL, University of North Texas, and GCLab
Incremental checkpoints are useful for applications with infrequent updates such as graph applications. We want to expand Kokkos Resilience and VeloC to support other access patterns seen in applications. What other access patterns can we use to augment checkpoint behavior?
Not all data needs to be saved in a checkpoint. Manually filtering out unnecessary data is more efficient but requires more effort from the developers. How can we use application access patterns to automatically determine whether or not data needs to be checkpointed?
VeloC provides a simple but powerful API that makes adding checkpointing capabilities to applications easy. Kokkos Resilience aims to further automate the process for Kokkos applications. What improvements can we make to the Kokkos Resilience interface to help users control checkpoint behavior in an application agnostic fashion?
We have shown that information about the frequency of updates to data can be gathered from Kokkos and used to enhance checkpoint performance. What other information or statistics can we gather to tune VeloC performance?
This work provides an optimal checkpointing strategy to protect iterative applications from fail-stop errors.
- Identify JLESC applications that are iterative and determine their task lengths and checkpoint costs
- Extend the study to fit the framework of some JLESC applications
- Maybe consider applications that are composed of repeated executions of the same DAG (not just a linear chain)?
Particle and ensemble Kalman filtering are standard techniques for ensemble data assimilation for weather and climate prediction. Data assimilation aims to combine observation and numerical simulation to achieve an accurate description of the system's state. Both techniques carry the state's probability distribution function within the ensemble of states. As a consequence, the ensemble needs to be sufficiently large. As climate models on their own often request a large number of resources, running thousands of them quickly grows to a very large scale. It is still common to leverage the file system to communicate the ensemble states between different software layers of the data assimilation framework. However, as the modeling space and resolution increase, this practice poses a bottleneck. On the other hand, in-memory implementations are hard to protect due to the very same issue of the IO-system posing a bottleneck. Our implementation uses the client-server structure of Melissa-DA and circulates particles in the background through the parallel filesystem, using node-local worker-processes. A combination of dynamic scheduling, prefetching and a distributed cache layer on the compute nodes allow parallel efficiencies of over 90% even at very large scales.
The largest execution we performed has been on Fugaku (Riken) on 2048 nodes with 32K particles. At that scale and having a state size of 512 Mb, we potentially write 1 TB of data concurrently. We haven't reached a sufficient scale to really saturate IO throughput of the parallel file system. Initially, we used the network to request states from other runners instead of loading them from the parallel file system. However, at the operated scale, this approach is not faster. It is still an open question if there is a configuration in which a peer-to-peer state distribution performs better.
We are working on storing the states using lossy methods that reduce the state size. For instance, lossy compression with ZFP or other interpolation methods. An open question is how the data assimilation responds to the additional model error imposed by those methods. The insight of climate researchers is very much appreciated for a discussion on this topic.
An always welcoming collaboration is to port weather and climate simulations to the framework.
Of particular interest in the context of HPC-AI convergence is the need to capture, persist, search for and reuse training/model data in a variety of scenarios: resilience, provenance tracking, reproducibility, workflows, algorithms that revisit previous results, etc. Unfortunately, traditional data storage on HPC platforms has seen a relatively static evolution: it is still centered around parallel file systems and key-value stores that offer only limited low-level capabilities, leaving a gap at application level that is currently filled with inefficient ad-hoc techniques. This talk introduces several efforts to address this gap, from checkpoint-restart techniques used in the context of resilience that can be extended as light-weight versioned caching layers up to novel data models centered around data states, which introduce a way to represent and reason about intermediate data as a first-class citizen.
How can we (re)use intermediate transformations of training data and/or DNN models to accelerate AI workflows (pre-processing, network architecture search, sensitivity analytics, etc)?
Hardware specialization is gaining more attention as the transistor scaling comes to a halt. At the same time, the open-source hardware ecosystems and tools are growing and becoming more practical. For example, Chisel hardware construction language, a domain-specific language implemented as Scala class libraries, is a popular open-source hardware language originally designed to describe RISC-V and employed by numerous hardware projects. We have been using Chisel for designing custom hardware components and performing functional verification. In this short talk, I briefly introduce Scala and Chisel alongside design examples and talk about how we use Chisel-based hardware designs
There are a couple of open questions related to this work. First, why/where hardware specialization is needed? What is needed to be specialized? How to implement them? Feasibility?
Needs for computing in restricted or contained environments (power, size, environments, stall-free, real-time), where no general-purpose computers meet requirements, are opportunities for hardware specialization. An example is real-time computing at the edge of scientific instruments.
With an increasing workload diversity and hardware complexity in HPC, the boundaries of today's runtimes are pushed to their limits. This evolution needs to be matched by corresponding increases in the capabilities of system management solutions.
Power management is a key element in the upcoming exascale era. First to allow us to stay within the power budget, but also for the applications to make the most of the available power in order to make progress. Therefore, our objective is to balance complex applications requirements while keeping power consumption under budget.
To achieve this goal, the Argo group is working on the Node Resource Manager (NRM) tool, which allows us to centralize node management activities such as resource and power management. The latter is achieved by getting information (monitoring) from various sensors (power, temperature, fan speed, frequency...) and adjusting actuators (CPU p-states, Intel RAPL) according to the application needs. The next step in our power management strategy is to improve NRM monitoring to more easily identify the location (within the topology) and scope (range of devices) that monitoring events are related to.
To evaluate our implementation, we are looking for JLESC members willing to extend this work with more complex applications with dynamic resource balancing problems, on which we first can observe such imbalance, and then address it with a better power management strategy relying on precise identification of the relation between the gathered monitoring events, the devices present, and the inner components of applications. We are aiming to get a better understanding of the behavior of such applications under various scenarios of power management, as well as studying the possibility of characterizing applications' power needs in order to develop an automated resource management policy.
waLBerla is a modern multiphysics simulation framework specifically designed to exploit the full power of the largest supercomputers for a wide class of scientific research questions. In past, Walberla has been run with near perfect scalability on both Piz Daint (2048 GPUs) and on JUQUEEN (~458k) compute cluster.
Now, as a part of the SCALABLE project funded by EuroHPC JU Project, Juelich is working on porting Walberla to ARM processors. This talk will focus on the main problems that will need to be solved in order to port Walberla to ARM processors. As a part of discussion and future work, we would like to investigate the possibilities of making such a port scalable to large ARM machines.
Open Questions:
- Are the fundamental data structures and algorithms designs in Walberla compatible with writing a scalable performant software for ARM systems? How might they be changed to make them work better for ARM (SVE, for example)?
- What are the previous experiences of extremely scalable multiphysics simulation codes to ARM processors and how they might apply to Walberla?
- What are the available options for a supercomputing center to offer ARM expertise and resources in performing the port?
Collaboration Opportunities:
- Opportunity to take a code that is scalable both on GPUs and x64 CPUs, and port them to ARM and try to achieve a similar scalability.
We present our experience with the programming language Julia for fast prototyping of novel GPU focused algorithms that leverage the language's portability features. This allowed us to port complex constrained optimization algorithms with very low developing effort on multiple architectures with advanced methods such as automatic differentiation.
We want to release our GPU portability efforts in Julia to the wider applied mathematics community and are looking for collaborations in the numerical optimization software field.
We are regulating of the injection of bag-of-tasks jobs with low priority into a computing grid to use idle resources. However, this injection can deteriorate the overall performance of the clusters (e.g. the performance of the parallel filesystem). Thus, we need to regulate the injection of those BoT jobs in order to guarantee a quality of service for the priority users of the grid. To do so, we are using control theory tools. This work is joint with Eric Rutten (INRIA) and Olivier Richard (UGA)
- Resource harvesting
- Dynamic resource & jobs management
- Automnomic Computing and Control Theory
We are doing performance and scaling measurements with the Earth Model Systems ICON and TSMP on modular heterogeneous architectures (JUWELS and DEEP system (JSC)). We will address first porting and scaling results for the homogeneous vs heterogeneous runs and performance analysis results, too.
Since our work is at the beginning, many questions are open. For example what kind of performance gain (if any) can we expect using ICON or TSMP with the modular system as opposed to a standard homogeneous system? Does the modular approach place any limitations on the frequency of data exchange between the two sub-models on the individual partitions, due to the increased latency between the individual hardware partitions?
We present the Feature Tracking Kit (FTK), a framework that simplifies, scales, and delivers various feature-tracking algorithms for scientific data. This talk will demonstrate technical highlights, use cases, and scalability studies of FTK through scientific applications including tokamak, fluid dynamics, and superconductivity simulations.
Seeking collaboration opportunities in the JLESC community on data analysis and visualization with advanced feature tracking capabilities.
The I/O bottleneck in next-generation computing systems constitutes a major challenge for the analysis of data-intensive high-performance computing (HPC) simulations. In situ processing, combined with in-memory computing, has emerged as a solution to overcome I/O bottlenecks in HPC systems, yet how to exploit these technologies efficiently from the infrastructure and human resources perspectives is still a challenging open problem.
This work targets one of the most common applications on petascale: molecular dynamics (MD) simulations studying the classical time evolution of a molecular system at atomic resolution. We address the challenge of bringing in situ analytics to MD by designing and developing A4MD: a workflow engine that features plug-and-play workflow composition of reusable building blocks; transparent in situ annotation capability_ for user-defined methods; simulator-agnostic runtime trajectory analysis; support for stopping, starting and restarting simulations; and ensemble workflow patterns optimized for HPC environments. A4MD empowers scientists to integrate simulation and analytics into complex workflows for runtime detection of changes in structural and temporal molecular properties. This knowledge of molecular structures' transformations at runtime can be used to steer simulations to more promising areas of the simulation space, identify the data that should be written to congested parallel file systems, and index generated data for retrieval and post-simulation analysis.
We demonstrate the capabilities of A4MD by studying a case of enhanced adaptive sampling for the exploration of the conformational space_ of the FS peptide protein. We model the execution of an ensemble of trajectories and analyse the overall throughput obtained by the workflow. Using A4MD, we integrate runtime trajectory analysis to automatically detect inefficiencies in the behaviour of the simulations.
Collaboration opportunities in topics of interest to JLESC include:
- Orchestration and placement of workflow components to maximize resource utilization. A4MD is able to terminate and restart simulations based on scientific criteria (e.g., a simulation is not reaching the desired state). As shown in the presentation, this procedure already improves the overall efficiency of the ensemble. However, we are currently not monitoring the simulations in terms of their resource utilization, and we are not making placement decisions to mitigate system inefficiencies.
- Annotation, indexing and storage of trajectory metadata. Although A4MD has the mechanisms in place to automatically generate simulation metadata, we are currently not managing the aspects of persistent storage that arise from annotation, indexing and querying of these products.
- Applications. We look for use cases that can leverage the wide range of functionalities in A4MD, especially in the area of enhanced adaptive sampling through MD ensembles.
A proteinÕs structure determines its function. Different proteins have different structures; proteins in the same family share similar substructures and thus may share similar functions. Additionally, one protein may exhibit several structural states, also named conformations. Identifying different proteins and their conformations can help solve problems such as determining the cause of diseases and designing drugs. X-ray Free Electron Laser (XFEL) beams are used to create diffraction patterns (images) that can reveal protein structure and function. The translation from diffraction patterns in the XFEL images to protein structures and functionalities is nontrivial.
We present the current status of our framework XPSI (XFEL-based Protein Structure Identifier). XPSI combines DL (autoencoder) and ML (kNN) to capture key information that allows the identification of properties, such as spatial orientation, protein conformation, and now protein type from the diffraction patterns. In our previous talk, we explored complex protein diffraction imaging datasets: three imaging resolutions and two proteins with four conformations each. We added the prediction of the 3rd rotational angle for spatial orientation and enunciated the symmetry challenges for its prediction. For this edition, we propose a solution to overcome the symmetry challenges and obtain higher accuracy for the 3rd rotational angle prediction. Furthermore, we add the identification of multiple proteins, completing the properties required to identify the structure of proteins and do 3D reconstruction of the proteins. We quantify the classification accuracy and performance of XPSI for two protein types: EF2 and ribosome, with two conformations each, and two imaging resolutions. We obtain an orientation prediction error up to 10 degrees, a difference in the predicted third angle up to 5 degrees, conformation accuracy prediction of 92%, and 100% of protein type accuracy prediction.
As next steps, we are working on 1) aspects of shareability and portability through a Jupyter notebook on platforms like Wasabi or Google Collab; 3) testing the framework with more data (i.e. three or more protein types with two or more conformations each); and 4) comparing our framework with a traditional XFEL slice matching approach which is used for 3D reconstruction and follows the premises of orientation prediction through multiple and time consuming iterations.
This project is collaborative research between RIKEN, GCLab, and ICL.
What is the suitability of the current framework for real-world cases (e.g., 3-D reconstruction of protein in general and ribosome in particular, annotation of images for their classification without human intervention)?
What is the range of errors (i.e., distribution of orientation and/or conformation) tolerable for scientists?
What are the costs in terms of computational requirements (execution time and resources) for other methods that identify proteinsÕ properties used by the scientists and how does our framework compare?
Can we use systematic data from simulations to train the framework and use the trained models for classification/prediction of real experiment data successfully? What aspects from the science, such as, rotational orientation, beam resolution, symmetrical proteins, can presumably limit the functionality of the framework?
Scientific workloads executed in High Performance Computing (HPC) environments are becoming increasingly complex while routinely producing terabytes to petabytes of intermediary or long-term data products. Optimizing Input/Ouput (I/O) to absorb these data near the optimal performance offered by the underlying storage systems is a major challenge to science users. In particular, many established performance tools currently do not allow to take a holistic perspective on workflow I/O because the fusion of workflow and performance artifacts is complicated. For example, task-data dependencies are often only known to the workflow engine. Performance counters are usually captured through system APIs or third-party instrumentation, but observed performance has to be evaluated with respect to the execution context which also should include topological details and capacity constraints of an allocation in which workflow tasks or applications are executed.
In this work we present the I/O Behavior Analysis Toolkit (IOBAT), that was specifically developed for the fusion and interactive exploration of workflow-, system- and performance artifacts. We use IOBAT as a prototype framework to inform the requirements for both programmatic and interactive analysis of workflow I/O behavior. IOBAT deliberately opts to develop reusable analysis and visualization components that can be used in Jupyter Notebooks as well as in custom application or web-based dashboards.
This project is a collaboration with ANL as well as with the University of Hamburg and the University of Magdeburg.
While visualizing even large amounts of artifacts is possible, the presentations can quickly degrade in usefulness without good strategies to filter, reduce, or cluster information. Many of these as well as aspects of level of detail can be automated and augmented using techniques from machine learning.
We are seeking partners to instrument workflows (both with and without workflow engines) to analyse and inform additional tooling and post-processing before visualization.
Collaboration on developing guidance for introspection APIs and standard telemetry to capture by middleware, workflow engines, and system services to help reduce gaps when stitching together workflow artifacts.
Over the past 6 years, JSC has been part of the Sage and Sage2 EU projects. These focused on first hand experience with Cortx; a newly developed object store from Seagate. For that purpose we have administered the main prototype in Juelich, ported some use cases and developed a set of middleware. The talk will focus on main issues faced with accommodating an object store, including challenges with moving use cases over from POSIX file systems and the required middlewares to facilitate the porting efforts.
As part of the discussion and future work, we would like to investigate how object stores are moving on from research, towards becoming part of a Supercomputing centre's main offered storage infrastructure.
- How are applications preparing and porting to object stores?
- What are the experiences with deployments of various large installations of object stores (ex. DAOS)?
- What are the available options for a Supercomputing centre to offer both POSIX and object storage interchangeably?
- Collaboration opportunities on middleware developed to ease access and porting of applications to object stores:
Here we consider that some of the principles are transferrable between the various available implementations of object stores, leading to similar problems being faced by all those who adopt them. A collaboration could identify similarities and attempt creating more generic tools. For example, an ingestion tool allowing for user configured formatted data transfer from a POSIX file system to an object store.
Data assimilation combines observational information with the numerical models to provide an improved estimate of the state of interest. In the field of numerical weather prediction, data assimilation plays an essential role in the advancement of weather forecasting via the assimilation of observations from various platforms. A global data assimilation system that couples the Non-hydrostatic Icosahedral Atmospheric Model (NICAM) with the Local Ensemble Transform Kalman Filter (LETKF), namely the NICAM-LETKF, was first developed in 2009. Since then, NICAM-LETKF has been extended to assimilate realistic observations that includes conventional and satellite platforms. To increase its computational efficiency on supercomputers, the code structure and data throughput of NICAM-LETKF was re-designed for the K-computer and further optimized for the Fugaku supercomputer. The NICAM-LETKF consists of three major steps: 1) observation operator, which transforms model state to observation space, 2) LETKF analysis, which takes the observation data and ensemble forecast data to perform data assimilation, and 3) NICAM ensemble forecast, which integrates the updated ensemble state forward to the next analysis time. During the LETKF analysis step, matrix multiplication and eigenvalue decomposition are employed to solve the analysis equation in the ensemble space for each grid point in parallel. The output from the LETKF analysis includes the mean state and the individual ensemble members in the original icosahedral grids stored in separate tiles (small groups of grid points). In this presentation, we will provide an overview of the NICAM-LETKF system and use a new feature that assimilates data from a satellite-borne precipitation radar as an example to demonstrate the workflow and the computational efficiency of the system.
How to further improve computational efficiency and I/O for large ensemble data assimilation?
Quantum chemistry software comprises immensely useful tools in material and biological science research, but only a few programs have been developed in Japan. The mission of our research team is to provide a high-performance software for molecular electronic structure calculation, and we have been developing NTChem package for general purpose and massively parallel computation on supercomputers including Fugaku. It implements not only standard quantum chemistry approaches, but also our original and improved theoretical methods. The package is successfully ported to Fugaku, the fastest supercomputer in the world, and shows capability of density functional theory, DFT, calculation of systems with more than 50000 atomic orbitals there.
NTChem is expected to be a useful tool in various computational studies for large and complicated molecular systems.
Currently NTChem2013 v12.1 is available on Fugaku, and we are open for collaboration.
Quantum gates are simulated in a numerically exact method by using the supercomputer Fugaku. Novel bit-swap algorithm for distributed computation is proposed in order to use memory as much as possible. Using this algorithm, it is possible to remove restriction in the previous bit-swap algorithm that the number of MPI processes is power of two.
How can we decrease to use memory more? (by data compression, using storage, etc.) How can we make MPI communication in our algorithm more efficiently?
This talk will focus on the in-house CFD solver CUBE (Complex Unified Building cubE) developed at R-CCS. CUBE is a unified solver framework capable of tacking various multi physics applications such as fluid flow, solid mechanics, fluid-structure interaction, combustion, droplet dynamics, etc. A brief overview of the background and the structure of the solver, and various applications for which it is used will be presented.
Dynamic Load balancing, AMR and Machine learning are area in which we are currently interested in expanding the capabilities of CUBE.
Molecular dynamics (MD) simulations have been widely used to explore biological phenomena that are difficult to study with experiments. Since all-atom MD simulations of large biomolecular complexes are computationally expensive, coarse-grained (CG) models based on different approximations have been developed. Here, we introduce an integrated implementation of several popular CG models in the MD program GENESIS. We developed a user-friendly toolbox to generate input files of residue-level CG models containing folded and disordered proteins, RNAs, and DNAs using a unified format and optimize the performance of CG MD simulations via efficient parallelization in GENESIS software. Our implementation will serve as a framework to develop novel CG models and investigate various biological phenomena in the cell.
We would like to explore the extendability of our program, to include more CG models for different biomolecules, and to integrate with force fields of different resolutions. Besides, it's also important to consider the compatibility of our program with the modeling tools.
We introduce new parallel computing algorithms in the GENESIS software to perform cellular-scale molecular dynamics (MD) simulations on Fugaku. It includes (1) the new algorithm of real-space non-bonded interactions , (2) reciprocal-space non-bonded interactions minimizing communicational cost, (3) accurate temperature/pressure evaluations that allows a large time step, and (4) effective parallel file inputs/outputs (I/O) for MD simulations of extremely huge systems. Based on these developments, GENESIS can perform MD simulations of the system containing 1.6 billion atoms with the performance of 8.30 ns/day. It will extend the available size and time of MD simulations to answer unresolved questions of biomolecules in a living cell.
I'm interested in effective file I/O of a very large system.
As compute node complexity in high-performance computing keeps increasing, systems equipped with heterogeneous memory devices are becoming prevalent. Efficiently utilizing heterogeneous memory based systems, however, poses significant challenges to application developers.
System software level transparent solutions utilizing artificial intelligence and machine learning based approaches, in particular non-supervised learning based methods such as reinforcement learning, may come to the rescue.
However, such methods require rapid estimation of execution runtime as a function of the data layout across memory devices for exploring different data placement strategies in an iterative fashion, rendering architecture level simulators (e.g., gem5, Ramulator, etc.) impractical for this purpose.
We investigate a differential tracing based approach where we obtain memory access traces using high frequency sampling (e.g., Intel's PEBS) on real hardware running out of different memory devices. We develop a runtime estimator based on such traces that provides a runtime estimate orders of magnitudes faster than full system simulators.
This talk introduces the approach we took, provides preliminary results on estimation accuracy and discusses future directions for its application to machine learning based heterogeneous memory management.
- Methods to enhance accuracy.
- Integration with machine learning.
- Evaluation on different platforms with heterogenous memory devices.
I am employing modern game engines to facilitate high-quality visualizations of generated plant models. My work includes the generation of ground truth data for machine learning algorithms and inspecting its use and training behavior. For this, I need high-quality plant simulations, parametrizations, and both graphical features as well as a robust pipeline for efficient network training. Currently, I am working on a distributed learning setup and the parameter spaces for my virtual scenes.
We need scene parametrizations in multitudes of ways to provide realism while enabling augmentation. We need to know precisely what to augment and what to ignore. There is need for learning architectures that support large image processing as well as optimizations over data usage and computational complexity in an extremely heterogeneous pipeline.
The increasing need for real-time analytics motivated the emergence of new incremental methods to learn representations from continuous flows of data, especially in the context of the Internet of Things. This trend led to the evolution of centralized computing infrastructures towards interconnected processing units spanning from edge devices to cloud data centers. This new paradigm is referred to as the Computing or Edge-to-Cloud Continuum. However, the network and compute heterogeneity across and within clusters may negatively impact Deep Learning (DL) training. I would like to introduce a roadmap for understanding the end-to-end performance of DL workloads in such heterogeneous settings. The goal is to identify key parameters leading to stragglers and devise novel intra- and inter-cluster strategies to address them.
I would like to identify potential collaborations to work in a heterogeneous, computing continuum type context. I am focusing on deep learning training workloads.
PyDDA (Pythonic Direct Data Assimilation) is a 3DVAR framework that assimilates data from an arbitrary number of weather radars together with other spatial wind fields in order to retrieve high resolution three dimensional wind fields. PyDDA minimizes a cost function which is a weighted sum of individual constraints using SciPyÕs implementation of L-BFGS-B. This implementation takes advantage of as many cores that are present in a single machine allowing for fast convergence. At Argonne we have ~27 TB of radar data in Darwin, Australia ready for testing PyDDA at scale. Because our problem exposes large scale data parallelism, it is highly suitable to be mapped to tensors in hardware.
We recently reimplemented PyDDA using JAX. We have recently attempted to run PyDDA on a Groq TSP, a Graphcore IPU, and an NVDIA GPU. In doing so, we attempted to reimplement PyDDA using PyTorch and Tensorflow. We were successful in reimplementing and validating PyDDA using Tensorflow. We will explain the challenges in this effort and further work that remains.
We are interested in porting all aspects of the code to the accelerators instead of the cost functions alone. We are also looking to port the code to other accelerators.
Funded by a recent ASCR project, we are developing a privacy-preserving federated learning framework that implements distributed learning algorithms for AI/ML models and data with privacy. We introduce the current status of the framework and discuss challenges and potential research directions.
The questions and collaboration opportunities include the integration of advanced data compression/encryption in federated learning, applications to edge devices, leverage and integration into existing frameworks that allow for model training on HPC, and federation of heterogeneous architectures.
In recent years Physics Informed Neural Networks (PINNs), neural networks that are aware of the underling physics of the problem they try to solve, have become a popular tool for the solution of Partial Differential Equations (PDEs). In this work we exploit ideas stemming from classical solution methods for PDEs, to accelerate the convergence of the training phase of the PINNs. In particular, drawing inspiration from multigrid methods, we consider two different training strategies, which both exploit the structure of the training problem to build a sequence of problems of decreasing dimension, approximating the original one. The knowledge of such hierarchy is exploited to reduce the major cost per iteration of classical optimization methods, that is the step computation. The proposed method thus allow for a faster training process, less dependant on the choice of the learning rate and more effective for problems with different modes.
The approach that we present has just been tested numerically, a meaningful research perspective is the development of its theoretical analysis. Another important direction that we would like to pursue is to test the method on real world problems.
Sparse linear algebra routines are fundamental building blocks of a large variety of scientific applications. Direct solvers, which are methods for solving linear systems via the factorization of matrices into products of triangular matrices, are commonly used in many contexts. The Cholesky factorization is the fastest direct method for symmetric and definite positive matrices. In this talk, we present selective nesting, a method to determine the optimal task granularity for the parallel Cholesky factorization based on the structure of sparse matrices. We propose the OPT-D-COST algorithm, which automatically and dynamically applies selective nesting. OPT-D-COST is the first approach to leverage matrix sparsity to drive complex task- based parallel workloads in the context of direct solvers. We run an extensive evaluation campaign considering a heterogeneous set of 60 sparse matrices and a parallel machine featuring the A64FX processor. OPT-D-COST delivers an average performance speedup of 1.46_ with respect to the best state-of-the-art parallel method to run direct solvers.
We plan to collaborate con other researchers to: 1) find new examples of applications where this technique could be applied 2) find new ways to determine the optimal granularity for Cholesky based on the structure of the non-zeros as several matrices still resist the proposed algorithm. We would like to investigate why they are different from the others. 3) discuss strategies to further reduce the idle time of the different threads (hybrid algorithms?)
Dense systems of linear equations are usually solved using Gaussian elimination with partial pivoting. However, the growing gap between computational power and communication bandwidth makes pivoting increasingly costly, especially on distributed, heterogeneous systems. We present our work on using randomized transforms to solve the systems accurately without pivoting.
The approach has been analyzed on artificial test matrices; however, these matrices may or may not be representative of linear systems that arise in practice. So, it would be valuable to collaborate with real applications in order to understand how well the approach works on real linear systems.
An important way to improve the scalability within orthogonalizing a set of basis vectors is to reduce the necessary communication therein. We present a version of Classical Gram-Schmidt with reorthogonalization (CGS2) which reduces the synchronizations per iteration from three to one. The focus of this work is in the context of a QR factorization and Krylov expansions, where the expansion is one-column-at-a-time and "left" - looking.
An extension to the presented work is to consider having access to more than one-column-at-a-time and orthogonalizing a set of basis vectors by block. Additionally, by delaying varying operations to achieve fewer synchronizations, this reduces the number of passes on data which can lead to a better memory footprint. Can this be extended beyond an orthogonalization scheme to any additional algorithms of interest?
In symmetric tridiagonal NxN matrices with very large N, we find K eigenvalues and eigenvectors in a thin "slice" of the eigenspectrum, so K << N. These must be orthogonalized to return to the user; and this O(N*K^2) orthogonalization (currently done with TSQR (tall-skinny QR.) takes up to 95% of the runtime even in parallel code with 72 cores.
The ST eigenvalues should be distinct and the vectors returned should be orthogonal. This is the case when eigenvalues are "well separated". The non-orthogonality occurs in clusters where eigenvalues are very close together, either absolutely, or relative to their magnitude. (This is not well characterized, just empirically noted.)
The current TSQR orthogonalizes every vector against every other, but there may be an opportunity to get large performance improvements here by SELECTIVE orthogonalization, if we can use the vector of eigenvalues to characterize smaller clusters of eigenvectors that need to be orthogonalized.
It is an open question whether this kind of partitioning is possible. Empirically, in our test matrices, we've seen some eigenvector sets that need no orthogonalization, others with just one cluster of about (K/10) that needed it. But O(N*(K/10)^2) would be a 100-fold speedup. The Wilkinson matrix seems to need only pairwise orthogonalization, an O(N*K) operation instead of O(N*K^2). Other collaboration might be providing real-world test matrices where spectrum slicing applies, we know some applications are in quantum physics.
Geostatistics predicts desired quantities from geographically distributed data, based on statistical models and optimization of parameters. A primary computational kernel is the evaluation of the maximum log-likelihood estimation (MLE) function, whose central data structure is a dense, symmetric, and positive definite covariance matrix and two essential operations, the application of the inverse and evaluation of the determinant. Both operations can be rendered through the Cholesky decomposition and triangular solution. The key observation is that environmental characteristics typically exhibit loss of correlation with distance, which motivates our study with mixed-precision factorization.
We incorporate PaRSEC dynamic runtime system into the ExaGeoStat framework. By utilizing algorithmic, architectural, and programming model features of PaRSEC, ExaGeoStat-PaRSEC is able to employ numerical approximations by means of mixed-precision computations to reduce the arithmetic complexity and memory footprint. We validated the numerical precision of our proposed method with synthetic datasets and a real world test case and highlighted significant performance improvements on large scale multi-GPU systems (Summit). To our knowledge, this is the first time a three-precision Cholesky solver is deployed on large-scale GPU-based systems. This may permit us to achieve exascale geospatial applications, while providing new statistical insights at unprecedented scales.
This is joint work with Jan Hueckelheim, ANL. Our goal is to apply Algorithmic Differentiation, in its adjoint (aka reverse) mode, to OpenMP source code. We describe transformation rules, which we justify through the PGAS (Partitioned Global Address Space) formalism, then focus on the main issue of the race conditions introduced by adjoint AD. We investigate several approaches to address this issue and measure their performance on classical benchmarks.
Issue: efficient implementation of atomic operations, especially of the "+=" kind Issue: reusing data-dependency analysis of an original code, for its adjoint differentiated code.
Today's scientific high performance computing (HPC) applications or advanced instruments are producing vast volumes of data across a wide range of domains, which introduces a serious burden on data transfer and storage. Error-bounded lossy compression has been developed and widely used in scientific community, because not only can it significantly reduce the data volumes but it can also strictly control the data distortion based on the use-specified error bound. Existing lossy compressors, however, cannot offer ultra-fast compression speed, which is highly demanded by quite a few applications or use-cases (such as in-memory compression and online instrument data compression). In this talk, we propose a novel ultra-fast error-bounded lossy compressor, which can obtain fairly high compression performance on both CPU and GPU, also with reasonably high compression ratios. The key contributions are three-fold: (1) We propose a novel, generic ultra-fast error-bounded lossy compression framework -- called SZx, by confining our design to be composed of only super-lightweight operations such as bitwise and addition/subtraction operation, still keeping a certain high compression ratio. (2) We implement SZx on both CPU and GPU and optimize the performance according to their architectures carefully. (3) We perform a comprehensive evaluation with 10 real-world production-level scientific datasets on both CPU and GPU. Experiments show that SZx is 2~16X as fast as the second-fastest existing error-bounded lossy compressor (either SZ or ZFP) on CPU and GPU, with respect to both compression and decompression.
Why is lossy compression speed very important? How to leverage ultra-fast lossy compression in practice?
On large clusters, communication become more and more the bottleneck. With the upcoming exascale machines, the gap between the computation and the network bandwidth is even larger than before.
In this talk, we consider different compression techniques such as casting operation and show how their integration into the communication phase can be beneficial to an application. Moreover, in the case of mixed-precision, where a loss of accuracy is acceptable, this approach improves the execution time and is able to also reach a batter accuracy. We illustrate this in the context of a 3D FFT code, where our approach gives one order of magnitude better accuracy than a full execution in lower precision.
We want to understand how this approach can be integrated into existing codes.
Lossy compression plays a growing role in scientific simulations where the cost of storing their output data can span terabytes. Using error-bounded lossy compression reduces the amount of storage for each simulation; however, there is no known bound for the upper limit on lossy compressibility. Correlation structures in the data, choice of compressor and error bound are factors allowing larger compression ratios and improved quality metrics. Analyzing these three factors provides one direction towards quantifying limits of lossy compressibility. As a first step, we explore statistical methods to characterize correlation structures present in the data and their relationships, through functional regression models, to compression ratios. In particular, compression ratios of the widely used lossy compressors for scientific data SZ, ZFP and MGARD exhibit a logarithmic dependence to the global and local correlation ranges when combined with information on the variability of the considered fields through the variance or gradient magnitude. Further works will focus on providing a unified characterization of these relationships across compressors and error bounds. This consists of a first step towards evaluating the theoretical limits of lossy compressibility used to eventually predict compression performance and adapt compressors to correlation structures present in the data.
Existence and characterization of theoretical limits of lossy compressibility
Recent advances in Deep Neural Networks (DNNs) have demonstrated a promising potential in predicting the temporal and spatial proximity of time evolutionary data. In this short talk, we introduce an effective (de)compression framework called TEZIP that can support dynamic lossy and lossless compression of time evolutionary image frames with high compression ratio and speed.
Can we extend TEZip with more domain specific predictor ?
This presentation will summarize recent improvements in capabilities for generic lossy compression using LibPressio powered by compressors such as SZ, ZFP, MGARD, and others including improvements to blackbox compression configuration techniques and parallel compression.
Open Questions/Collaboration Opportunities:
1. What are the remaining gaps in capabilities of compression interfaces for applications? (i.e. better support for asyncrony, accelerators, streaming compression, point cloud, and/or sparse compression, etc...). Ideal collaborators have applications that are considering/using compression, but would like to consider ways to improve performance or evaluate different compression methodologies.
2. How can we further drive down the costs of black box compression techniques to enable their use by a wider community? (i.e. faster search techniques that use fewer invocations of the compressors, proxy-model based approaches, techniques to use fewer tunings, etc...) Ideal collaborators have experience in numerical optimization/black-box optimization techniques, transfer learning inspired optimization approaches, or a related subject and have interest in apply this to a inter-disciplinary problem of bounding user metrics.
This talk looks at proposals meant to improve both usability and efficiency of the MPI RMA interface. Some of the proposals include improved dynamic window memory handling, atomic operation effciency, and multi-threaded passive-target synchronization.
We are specifically looking for feedback of developers that are using or have previously tried to use the MPI RMA and input on what applications or higher-level runtime systems expect and require from the interface in an attempt to overhaul the API for the 5.0 release of the MPI standard.
In this presentation, I will introduce TTG, a novel programming model and its C++ implementation that by marrying the ideas of control and data flowgraph programming supports compact specification and efficient distributed execution of dynamic and irregular applications. Programming interfaces that support task- based executions often only support shared memory parallel environments; a few support distributed memory environments, either by discovering the entire DAG of tasks on all processes, or by introducing explicit communications. The first approach limits scalability, while the second increases the complexity of programming. I will demonstrate how TTG can address these issues without sacrificing scalability or programmability by providing higher-level abstractions than conventionally provided by task- centric programming systems, but without impeding the ability of these runtimes to manage task creation and execution as well as data and resource management efficiently.
TTG aims at relieving the difficulty of programming data-dependent applications or applications with irregular and imbalanced memory access. New applications could challenge the programming model and allow to extend it to support a large spectrum of cases.
In this talk we will show the use of CUDA-graphs to accelerate OmpSs benchmarks. With OmpSs, we can have a set of CUDA tasks recorded once within the CUDA runtime system, and replayed many times, thus reducing the amount of overhead introduced by the GPU management. In order to achieve this, we use the high-level CUDA graphs API.
We are open for collaborations in this field, for example with new ideas on how to use the CUDA graphs API, the development of additional benchmarks or applications with OmpSs, etc.
Finding the set of parameters that gives the best performance for a given application is very challenging. The exploration of the space spanned by the parameters to find the best set is combinatory and often requires unrealistic budget. To address this, some techniques, called auto-tuning, relies on Bayezian learning to create a model of performance.
We present in this talk an auto-tuning code based on deep Gaussian Process.
We show how the multi-task and multi-fidelity approaches can be mixed with the transfer learning technique to improve the method. Our experiments on the SLATE library demonstrate how with a limited budget, i.e., a fixed number of runs, we are able to obtain more accurate information than state-of-the-art codes.
We believe this auto-tuning code is a great opportunity to improve the performance of existing applications.
One of the most important categories of application performance is data traffic between the processing cores and the main memory. However, since the related performance event counters are not core-private, applications require elevated privileges to access them. PAPI offers a component that can access this information on IBM systems through the Performance Co-Pilot (PCP), but doing so adds an indirection layer that involves querying the PCP daemon. Using such benchmarks as DGEMM and STREAM, we study the accuracy of the measurements obtained through the PCP component on the Summit supercomputer.
Open Questions: Other than memory traffic events, what categories of "nest" (IBM equivalent of "uncore") events would be insightful to measure? Which scientific applications could benefit from measuring these events?
This presentation will introduce the DFG/RFBR projekt ExtraNoise (see https://www.vi-hps.org/projects/extranoise/overview/overview.html). Its goal is to develop performance measurement, analysis and modelling methods which are resilient to system noise. A second goal is to better understand the noise sensitivity of current HPC systems, applications and algorithms.
We would like to connect with other projects and groups which are working on system noise related issues.
The C Configuration Space and Tuning Library (CCS) is a framework created to bridge the gap between auto-tuning frameworks and application with auto-tuning needs. CCS offers a C API to describe rich configuration spaces and objectives, as well as the auto-tuners to optimize those problem. CCS was conceived with performance in mind, in order to be integrated in and driven by low level runtimes and applications (JIT, HPC runtimes, etc...). CCS also comes with a set of Python and Ruby bindings that allows prototyping auto-tuners in high level languages and invoking them from the low-level applications.
This presentation will introduce the features of CCS and describe the co-design work that happened with the Kokkos Tuning API.
Collaboration opportunities:
- teams working with runtime systems and with hyperparameters to optimize could be interested in interfacing with CCS
- teams working on auto-tuning strategies could be interested in steering CCS in a direction that allows describing their problems and auto-tuning strategies
Distributed digital infrastructures for computation and analytics are now evolving towards an interconnected ecosystem allowing complex applications to be executed from IoT Edge devices to the HPC Cloud (aka the Computing Continuum, the Digital Continuum, or the Transcontinuum). Understanding and optimizing end-to-end performance of applications in such a complex continuum is challenging. This breaks down to reconciling many, typically contradicting application requirements and constraints with low-level infrastructure design choices. One important challenge is to accurately reproduce relevant behaviors of a given application workflow and representative settings of the physical infrastructure underlying this complex continuum.
In this talk, I will introduce a rigorous methodology for such a process and its implementation in E2Clab, a framework supporting the reproducible and complete experimental cycle across the Edge-to-Cloud Continuum: deployment, monitoring, analysis, optimization. Next, I will illustrate our methodology by optimizing Pl@ntNet, a world-wide plant identification application.
Finally, I will briefly highlight the ongoing work on adding support to small wireless sensor edge devices provided by the very large scale FIT IoT LAB testbed, in E2Clab. Furthermore, I will present the research challenges on enabling provenance capture in Edge-to-Cloud experiments (How do the existing provenance capture systems perform in resource-constrained environments? How to efficiently capture provenance in this context?). The ultimate goal is that the data captured could provide support to the computational reproducibility and allow users to better understand how the experiment results have been produced.
How do the existing provenance capture systems perform in resource-constrained environments?
How to efficiently capture provenance in Edge-to-Cloud experiments? How have the experiment results been produced (e.g., hardware, software, and application configurations used)?
Potential collaborators: researchers aiming to understand and optimize the performance of their real-life use cases on large scale infrastructures in a systematic and reproducible approach. Do you have any use cases to share?
In this work, we study the problem of online adapting a geostatistics application, ExaGeoStat, to the best set of available heterogeneous nodes per application phase. ExaGeoStat is an iterative application that computes the MLE for spatial data. We consider the usage of the Gaussian Process (GP) to model the expected performance of a group of system-level heterogeneous nodes per application iteration. We show that by correctly modeling the GP trend, limiting the search space with a linear program prediction, and considering the heterogeneity with dummy variables, we can quickly adapt at runtime and reduce the total amount of computational nodes used.
The model is tailored for this application. We are interested in other task-based applications to evaluate the generality of our approach.
This presentation will cover EQ-SQL, a database-oriented backend for the EMEWS framework. This project is a derivative and generalization of components developed for COVID-19 modeling workflows built in 2020.
This topic touches on workflow interoperability, which may be of interest to others in JLESC.
In this short talk we will present an on-going initiative at INRIA to build a unified portfolio of HPC codes developed at INRIA, adopting good practice for improving productivity, sustainability and reproducibility, partly building on the E4S initiative.
Expect feedback from other partners involved in similar experiences, in particular about the E4S initiative.
joint work by Sophie Cerf 1 Raphal Bleuse 1 Valentin Reis 2 Swann Perarnau 2 Eric Rutten 1 (1 : Inria, 2 : ANL), Europar 21
Production high-performance computing systems continue to grow in complexity and size. As applications struggle to make use of increasingly heterogeneous compute nodes, maintaining high efficiency (performance per watt) for the whole platform becomes a challenge. Alongside the growing complexity of scientific workloads, this extreme heterogeneity is also an opportunity: as applications dynamically undergo variations in workload, due to phases or data/compute movement between devices, one can dynamically adjust power across compute elements to save energy without impacting performance. With an aim toward an autonomous and dynamic power management strategy for current and future HPC architectures, this paper explores the use of control theory for the design of a dynamic power regulation method. Structured as a feedback loop, our approach-which is novel in computing resource management-consists of periodically monitoring application progress and choosing at runtime a suitable power cap for processors. Thanks to a preliminary offline identification process, we derive a model of the dynamics of the system and a proportional-integral (PI) controller. We evaluate our approach on top of an existing resource management framework, the Argo Node Resource Manager, deployed on several clusters of Grid'5000, using a standard memory-bound HPC benchmark.
dynamic resource & jobs runtime management
Automnomic Computing, Feedback Loops, and Control Theory
We present a job management software, OACIS, with which you can manage jobs, conduct parameter scans, and inspect the results visually via an interactive interface. In my talk, I will present a demo of the software.
collaboration with other workflow systems
In recent years, power consumption in HPC data centers has reached tens of megawatt-scale of electricity due to the large scale of computing facilities. Most of the electricity consumed by IT equipment is converted into heat energy. Finally, the generated heat is exhausted by a cooling system that consists of mechanical devices with various technical constraints. Prior to the launch of the official service in Fugaku, we conducted benchmark campaigns to evaluate the entire performance. In this study, we analyze the detail of the cooling facility operations with operational metrics during the evaluation period.
We struggle with improving the operations by operational data analysis approach and are looking for opportunities to accelerate the project by collaborations. We welcome any and all ideas and attempts.
High Performance Computing systems are approaching exascale operations per second (10⏻18 FLOPS), significantly increasing in size and complexity of the systems. These developments also expand the variety and frequency of possible problems, demanding a continuous monitoring and analysis of the system and its usage during regular production. With this intent, the Jüulich Supercomputing Centre has been developing the LLview monitoring infrastructure, combining monitoring metrics from CPU, GPU, I/O and interconnect activities. With negligible overhead, LLview provides individualised fast access to job reports for users, mentors, support and administrators alike, which are then used to manage and support production runs. This presentation provides an overlook of LLview, a few success cases and perspectives for future enhancements and collaborations.
Exchange experience analysing and presenting report data, and how to use monitoring in daily HPC support activities. Discuss among centers about similar activities; get ideas on how to improve current available sources, metrics and presentation; discuss statistical analysis of collected performance data.
I will describe our approaches to running programs with power consumption
We will share our experience in operating large-scale facilities and aim at power-saving operation by utilizing hardware and facility functions.
For a single task, the Young/Daly formula provides the optimal checkpointing period to minimize the expectation of the total execution time. However, when many tasks execute simultaneously, the risk that one of them is severely delayed increases with the number of tasks. To mitigate this risk, a possibility is to checkpoint each task more often than with the Young/Daly strategy. This work studies the impact of checkpointing more than prescribed by the Young/Daly formula both on the theoretical and practical side.
- Is there a realistic workflow benchmark for parallel (rigid or moldable) tasks available for simulation? How can we find realistic instances combined with its failures laws? is there any JLESC partner interested in collaborating on this topic?
- In this work we focused on finding a good checkpointing strategy for a fixed givenschedule, but is it possible to design a global algorithm that schedules, maps and checkpoint tasks all in once?
We study the resilient scheduling of moldable parallel jobs on high-performance computing (HPC) platforms. Moldable jobs allow for choosing a processor allocation before execution, and their execution time obeys various speedup models. The objective is to minimize the overall completion time or the makespan, when jobs can fail due to silent errors and hence may need to be re-executed after each failure until successful completion. Our work generalizes the classical scheduling framework for failure-free jobs. To cope with silent errors, we introduce two resilient scheduling algorithms and show the approximation ratios for the two algorithms under several prominent speedup models (e.g., roofline, communication, Amdahl, power, monotonic, and a mix model). An extensive set of simulations is conducted to evaluate different variants of the two algorithms, and the results show that they consistently outperform some baseline heuristics.
While this works considers independent parallel jobs, one open question is study jobs with dependencies (i.e., workflows) and present efficient algorithms to cope with job failures in workflows. Another open question is to consider fail-stop errors and design effective strategies that combine scheduling and checkpointing. Collaboration opportunities exist amongst researchers working on resilience, system-level failure recovery, and workflow scheduling.
Portability and resilience are growing problems in high performance computing with different requirements that do not always get along. More systems are relying on accelerators for computational power while memory hierarchies expand to feed said accelerators. As supercomputers increase in scale and heterogeneity, efficiently leveraging the hardware becomes more difficult and the mean time between failures decreases. Portability abstractions such as Kokkos address the heterogeneity problem while checkpoint/restart systems such as VeloC provide resilience against system failures. We present a proof of concept that augments Kokkos applications using VeloC by enabling access pattern aware checkpointing (i.e., by leveraging application and system awareness to create efficient portable checkpoints).
The portability and checkpoint challenges present an interesting opportunity to leverage information from portability abstractions to augment checkpoint runtimes. Kokkos data abstractions contain information on how data is accessed. For example, whether concurrent accesses with many threads are done through atomics or data replication. These abstractions also simplify detecting more complex access patterns, such as sparse updates. The Kokkos Resilience project combines Kokkos abstractions with state-of-the-art checkpoint runtimes to provide checkpoint capabilities to Kokkos applications while remaining portable and minimizing constraints imposed on the developer. Kokkos Resilience uses VeloC, a scalable asynchronous checkpoint runtime, as the back-end.
We present the groundwork for integrating access pattern aware strategies into Kokkos Resilience abstractions and VeloC. Exploiting this opportunity will allow for more intricate checkpoint optimizations without the complexity normally required to implement and use them. These optimizations would also be portable and accessible by any Kokkos application. Specifically, we demonstrate the effectiveness of access pattern aware checkpointing by leveraging incremental checkpointing for sparsely updated data. We show how this approach allows for smaller and faster checkpoints while reducing complexity to users.
This project is collaborative research between ANL, SNL, University of North Texas, and GCLab
Incremental checkpoints are useful for applications with infrequent updates such as graph applications. We want to expand Kokkos Resilience and VeloC to support other access patterns seen in applications. What other access patterns can we use to augment checkpoint behavior?
Not all data needs to be saved in a checkpoint. Manually filtering out unnecessary data is more efficient but requires more effort from the developers. How can we use application access patterns to automatically determine whether or not data needs to be checkpointed?
VeloC provides a simple but powerful API that makes adding checkpointing capabilities to applications easy. Kokkos Resilience aims to further automate the process for Kokkos applications. What improvements can we make to the Kokkos Resilience interface to help users control checkpoint behavior in an application agnostic fashion?
We have shown that information about the frequency of updates to data can be gathered from Kokkos and used to enhance checkpoint performance. What other information or statistics can we gather to tune VeloC performance?
This work provides an optimal checkpointing strategy to protect iterative applications from fail-stop errors.
- Identify JLESC applications that are iterative and determine their task lengths and checkpoint costs
- Extend the study to fit the framework of some JLESC applications
- Maybe consider applications that are composed of repeated executions of the same DAG (not just a linear chain)?
Particle and ensemble Kalman filtering are standard techniques for ensemble data assimilation for weather and climate prediction. Data assimilation aims to combine observation and numerical simulation to achieve an accurate description of the system's state. Both techniques carry the state's probability distribution function within the ensemble of states. As a consequence, the ensemble needs to be sufficiently large. As climate models on their own often request a large number of resources, running thousands of them quickly grows to a very large scale. It is still common to leverage the file system to communicate the ensemble states between different software layers of the data assimilation framework. However, as the modeling space and resolution increase, this practice poses a bottleneck. On the other hand, in-memory implementations are hard to protect due to the very same issue of the IO-system posing a bottleneck. Our implementation uses the client-server structure of Melissa-DA and circulates particles in the background through the parallel filesystem, using node-local worker-processes. A combination of dynamic scheduling, prefetching and a distributed cache layer on the compute nodes allow parallel efficiencies of over 90% even at very large scales.
The largest execution we performed has been on Fugaku (Riken) on 2048 nodes with 32K particles. At that scale and having a state size of 512 Mb, we potentially write 1 TB of data concurrently. We haven't reached a sufficient scale to really saturate IO throughput of the parallel file system. Initially, we used the network to request states from other runners instead of loading them from the parallel file system. However, at the operated scale, this approach is not faster. It is still an open question if there is a configuration in which a peer-to-peer state distribution performs better.
We are working on storing the states using lossy methods that reduce the state size. For instance, lossy compression with ZFP or other interpolation methods. An open question is how the data assimilation responds to the additional model error imposed by those methods. The insight of climate researchers is very much appreciated for a discussion on this topic.
An always welcoming collaboration is to port weather and climate simulations to the framework.
Of particular interest in the context of HPC-AI convergence is the need to capture, persist, search for and reuse training/model data in a variety of scenarios: resilience, provenance tracking, reproducibility, workflows, algorithms that revisit previous results, etc. Unfortunately, traditional data storage on HPC platforms has seen a relatively static evolution: it is still centered around parallel file systems and key-value stores that offer only limited low-level capabilities, leaving a gap at application level that is currently filled with inefficient ad-hoc techniques. This talk introduces several efforts to address this gap, from checkpoint-restart techniques used in the context of resilience that can be extended as light-weight versioned caching layers up to novel data models centered around data states, which introduce a way to represent and reason about intermediate data as a first-class citizen.
How can we (re)use intermediate transformations of training data and/or DNN models to accelerate AI workflows (pre-processing, network architecture search, sensitivity analytics, etc)?
Hardware specialization is gaining more attention as the transistor scaling comes to a halt. At the same time, the open-source hardware ecosystems and tools are growing and becoming more practical. For example, Chisel hardware construction language, a domain-specific language implemented as Scala class libraries, is a popular open-source hardware language originally designed to describe RISC-V and employed by numerous hardware projects. We have been using Chisel for designing custom hardware components and performing functional verification. In this short talk, I briefly introduce Scala and Chisel alongside design examples and talk about how we use Chisel-based hardware designs
There are a couple of open questions related to this work. First, why/where hardware specialization is needed? What is needed to be specialized? How to implement them? Feasibility?
Needs for computing in restricted or contained environments (power, size, environments, stall-free, real-time), where no general-purpose computers meet requirements, are opportunities for hardware specialization. An example is real-time computing at the edge of scientific instruments.
With an increasing workload diversity and hardware complexity in HPC, the boundaries of today's runtimes are pushed to their limits. This evolution needs to be matched by corresponding increases in the capabilities of system management solutions.
Power management is a key element in the upcoming exascale era. First to allow us to stay within the power budget, but also for the applications to make the most of the available power in order to make progress. Therefore, our objective is to balance complex applications requirements while keeping power consumption under budget.
To achieve this goal, the Argo group is working on the Node Resource Manager (NRM) tool, which allows us to centralize node management activities such as resource and power management. The latter is achieved by getting information (monitoring) from various sensors (power, temperature, fan speed, frequency...) and adjusting actuators (CPU p-states, Intel RAPL) according to the application needs. The next step in our power management strategy is to improve NRM monitoring to more easily identify the location (within the topology) and scope (range of devices) that monitoring events are related to.
To evaluate our implementation, we are looking for JLESC members willing to extend this work with more complex applications with dynamic resource balancing problems, on which we first can observe such imbalance, and then address it with a better power management strategy relying on precise identification of the relation between the gathered monitoring events, the devices present, and the inner components of applications. We are aiming to get a better understanding of the behavior of such applications under various scenarios of power management, as well as studying the possibility of characterizing applications' power needs in order to develop an automated resource management policy.
waLBerla is a modern multiphysics simulation framework specifically designed to exploit the full power of the largest supercomputers for a wide class of scientific research questions. In past, Walberla has been run with near perfect scalability on both Piz Daint (2048 GPUs) and on JUQUEEN (~458k) compute cluster.
Now, as a part of the SCALABLE project funded by EuroHPC JU Project, Juelich is working on porting Walberla to ARM processors. This talk will focus on the main problems that will need to be solved in order to port Walberla to ARM processors. As a part of discussion and future work, we would like to investigate the possibilities of making such a port scalable to large ARM machines.
Open Questions:
- Are the fundamental data structures and algorithms designs in Walberla compatible with writing a scalable performant software for ARM systems? How might they be changed to make them work better for ARM (SVE, for example)?
- What are the previous experiences of extremely scalable multiphysics simulation codes to ARM processors and how they might apply to Walberla?
- What are the available options for a supercomputing center to offer ARM expertise and resources in performing the port?
Collaboration Opportunities:
- Opportunity to take a code that is scalable both on GPUs and x64 CPUs, and port them to ARM and try to achieve a similar scalability.
We present our experience with the programming language Julia for fast prototyping of novel GPU focused algorithms that leverage the language's portability features. This allowed us to port complex constrained optimization algorithms with very low developing effort on multiple architectures with advanced methods such as automatic differentiation.
We want to release our GPU portability efforts in Julia to the wider applied mathematics community and are looking for collaborations in the numerical optimization software field.
We are regulating of the injection of bag-of-tasks jobs with low priority into a computing grid to use idle resources. However, this injection can deteriorate the overall performance of the clusters (e.g. the performance of the parallel filesystem). Thus, we need to regulate the injection of those BoT jobs in order to guarantee a quality of service for the priority users of the grid. To do so, we are using control theory tools. This work is joint with Eric Rutten (INRIA) and Olivier Richard (UGA)
- Resource harvesting
- Dynamic resource & jobs management
- Automnomic Computing and Control Theory
We are doing performance and scaling measurements with the Earth Model Systems ICON and TSMP on modular heterogeneous architectures (JUWELS and DEEP system (JSC)). We will address first porting and scaling results for the homogeneous vs heterogeneous runs and performance analysis results, too.
Since our work is at the beginning, many questions are open. For example what kind of performance gain (if any) can we expect using ICON or TSMP with the modular system as opposed to a standard homogeneous system? Does the modular approach place any limitations on the frequency of data exchange between the two sub-models on the individual partitions, due to the increased latency between the individual hardware partitions?
We present the Feature Tracking Kit (FTK), a framework that simplifies, scales, and delivers various feature-tracking algorithms for scientific data. This talk will demonstrate technical highlights, use cases, and scalability studies of FTK through scientific applications including tokamak, fluid dynamics, and superconductivity simulations.
Seeking collaboration opportunities in the JLESC community on data analysis and visualization with advanced feature tracking capabilities.
The I/O bottleneck in next-generation computing systems constitutes a major challenge for the analysis of data-intensive high-performance computing (HPC) simulations. In situ processing, combined with in-memory computing, has emerged as a solution to overcome I/O bottlenecks in HPC systems, yet how to exploit these technologies efficiently from the infrastructure and human resources perspectives is still a challenging open problem.
This work targets one of the most common applications on petascale: molecular dynamics (MD) simulations studying the classical time evolution of a molecular system at atomic resolution. We address the challenge of bringing in situ analytics to MD by designing and developing A4MD: a workflow engine that features plug-and-play workflow composition of reusable building blocks; transparent in situ annotation capability_ for user-defined methods; simulator-agnostic runtime trajectory analysis; support for stopping, starting and restarting simulations; and ensemble workflow patterns optimized for HPC environments. A4MD empowers scientists to integrate simulation and analytics into complex workflows for runtime detection of changes in structural and temporal molecular properties. This knowledge of molecular structures' transformations at runtime can be used to steer simulations to more promising areas of the simulation space, identify the data that should be written to congested parallel file systems, and index generated data for retrieval and post-simulation analysis.
We demonstrate the capabilities of A4MD by studying a case of enhanced adaptive sampling for the exploration of the conformational space_ of the FS peptide protein. We model the execution of an ensemble of trajectories and analyse the overall throughput obtained by the workflow. Using A4MD, we integrate runtime trajectory analysis to automatically detect inefficiencies in the behaviour of the simulations.
Collaboration opportunities in topics of interest to JLESC include:
- Orchestration and placement of workflow components to maximize resource utilization. A4MD is able to terminate and restart simulations based on scientific criteria (e.g., a simulation is not reaching the desired state). As shown in the presentation, this procedure already improves the overall efficiency of the ensemble. However, we are currently not monitoring the simulations in terms of their resource utilization, and we are not making placement decisions to mitigate system inefficiencies.
- Annotation, indexing and storage of trajectory metadata. Although A4MD has the mechanisms in place to automatically generate simulation metadata, we are currently not managing the aspects of persistent storage that arise from annotation, indexing and querying of these products.
- Applications. We look for use cases that can leverage the wide range of functionalities in A4MD, especially in the area of enhanced adaptive sampling through MD ensembles.
A proteinÕs structure determines its function. Different proteins have different structures; proteins in the same family share similar substructures and thus may share similar functions. Additionally, one protein may exhibit several structural states, also named conformations. Identifying different proteins and their conformations can help solve problems such as determining the cause of diseases and designing drugs. X-ray Free Electron Laser (XFEL) beams are used to create diffraction patterns (images) that can reveal protein structure and function. The translation from diffraction patterns in the XFEL images to protein structures and functionalities is nontrivial.
We present the current status of our framework XPSI (XFEL-based Protein Structure Identifier). XPSI combines DL (autoencoder) and ML (kNN) to capture key information that allows the identification of properties, such as spatial orientation, protein conformation, and now protein type from the diffraction patterns. In our previous talk, we explored complex protein diffraction imaging datasets: three imaging resolutions and two proteins with four conformations each. We added the prediction of the 3rd rotational angle for spatial orientation and enunciated the symmetry challenges for its prediction. For this edition, we propose a solution to overcome the symmetry challenges and obtain higher accuracy for the 3rd rotational angle prediction. Furthermore, we add the identification of multiple proteins, completing the properties required to identify the structure of proteins and do 3D reconstruction of the proteins. We quantify the classification accuracy and performance of XPSI for two protein types: EF2 and ribosome, with two conformations each, and two imaging resolutions. We obtain an orientation prediction error up to 10 degrees, a difference in the predicted third angle up to 5 degrees, conformation accuracy prediction of 92%, and 100% of protein type accuracy prediction.
As next steps, we are working on 1) aspects of shareability and portability through a Jupyter notebook on platforms like Wasabi or Google Collab; 3) testing the framework with more data (i.e. three or more protein types with two or more conformations each); and 4) comparing our framework with a traditional XFEL slice matching approach which is used for 3D reconstruction and follows the premises of orientation prediction through multiple and time consuming iterations.
This project is collaborative research between RIKEN, GCLab, and ICL.
What is the suitability of the current framework for real-world cases (e.g., 3-D reconstruction of protein in general and ribosome in particular, annotation of images for their classification without human intervention)?
What is the range of errors (i.e., distribution of orientation and/or conformation) tolerable for scientists?
What are the costs in terms of computational requirements (execution time and resources) for other methods that identify proteinsÕ properties used by the scientists and how does our framework compare?
Can we use systematic data from simulations to train the framework and use the trained models for classification/prediction of real experiment data successfully? What aspects from the science, such as, rotational orientation, beam resolution, symmetrical proteins, can presumably limit the functionality of the framework?
Scientific workloads executed in High Performance Computing (HPC) environments are becoming increasingly complex while routinely producing terabytes to petabytes of intermediary or long-term data products. Optimizing Input/Ouput (I/O) to absorb these data near the optimal performance offered by the underlying storage systems is a major challenge to science users. In particular, many established performance tools currently do not allow to take a holistic perspective on workflow I/O because the fusion of workflow and performance artifacts is complicated. For example, task-data dependencies are often only known to the workflow engine. Performance counters are usually captured through system APIs or third-party instrumentation, but observed performance has to be evaluated with respect to the execution context which also should include topological details and capacity constraints of an allocation in which workflow tasks or applications are executed.
In this work we present the I/O Behavior Analysis Toolkit (IOBAT), that was specifically developed for the fusion and interactive exploration of workflow-, system- and performance artifacts. We use IOBAT as a prototype framework to inform the requirements for both programmatic and interactive analysis of workflow I/O behavior. IOBAT deliberately opts to develop reusable analysis and visualization components that can be used in Jupyter Notebooks as well as in custom application or web-based dashboards.
This project is a collaboration with ANL as well as with the University of Hamburg and the University of Magdeburg.
While visualizing even large amounts of artifacts is possible, the presentations can quickly degrade in usefulness without good strategies to filter, reduce, or cluster information. Many of these as well as aspects of level of detail can be automated and augmented using techniques from machine learning.
We are seeking partners to instrument workflows (both with and without workflow engines) to analyse and inform additional tooling and post-processing before visualization.
Collaboration on developing guidance for introspection APIs and standard telemetry to capture by middleware, workflow engines, and system services to help reduce gaps when stitching together workflow artifacts.
Over the past 6 years, JSC has been part of the Sage and Sage2 EU projects. These focused on first hand experience with Cortx; a newly developed object store from Seagate. For that purpose we have administered the main prototype in Juelich, ported some use cases and developed a set of middleware. The talk will focus on main issues faced with accommodating an object store, including challenges with moving use cases over from POSIX file systems and the required middlewares to facilitate the porting efforts.
As part of the discussion and future work, we would like to investigate how object stores are moving on from research, towards becoming part of a Supercomputing centre's main offered storage infrastructure.
- How are applications preparing and porting to object stores?
- What are the experiences with deployments of various large installations of object stores (ex. DAOS)?
- What are the available options for a Supercomputing centre to offer both POSIX and object storage interchangeably?
- Collaboration opportunities on middleware developed to ease access and porting of applications to object stores:
Here we consider that some of the principles are transferrable between the various available implementations of object stores, leading to similar problems being faced by all those who adopt them. A collaboration could identify similarities and attempt creating more generic tools. For example, an ingestion tool allowing for user configured formatted data transfer from a POSIX file system to an object store.
Data assimilation combines observational information with the numerical models to provide an improved estimate of the state of interest. In the field of numerical weather prediction, data assimilation plays an essential role in the advancement of weather forecasting via the assimilation of observations from various platforms. A global data assimilation system that couples the Non-hydrostatic Icosahedral Atmospheric Model (NICAM) with the Local Ensemble Transform Kalman Filter (LETKF), namely the NICAM-LETKF, was first developed in 2009. Since then, NICAM-LETKF has been extended to assimilate realistic observations that includes conventional and satellite platforms. To increase its computational efficiency on supercomputers, the code structure and data throughput of NICAM-LETKF was re-designed for the K-computer and further optimized for the Fugaku supercomputer. The NICAM-LETKF consists of three major steps: 1) observation operator, which transforms model state to observation space, 2) LETKF analysis, which takes the observation data and ensemble forecast data to perform data assimilation, and 3) NICAM ensemble forecast, which integrates the updated ensemble state forward to the next analysis time. During the LETKF analysis step, matrix multiplication and eigenvalue decomposition are employed to solve the analysis equation in the ensemble space for each grid point in parallel. The output from the LETKF analysis includes the mean state and the individual ensemble members in the original icosahedral grids stored in separate tiles (small groups of grid points). In this presentation, we will provide an overview of the NICAM-LETKF system and use a new feature that assimilates data from a satellite-borne precipitation radar as an example to demonstrate the workflow and the computational efficiency of the system.
How to further improve computational efficiency and I/O for large ensemble data assimilation?
Quantum chemistry software comprises immensely useful tools in material and biological science research, but only a few programs have been developed in Japan. The mission of our research team is to provide a high-performance software for molecular electronic structure calculation, and we have been developing NTChem package for general purpose and massively parallel computation on supercomputers including Fugaku. It implements not only standard quantum chemistry approaches, but also our original and improved theoretical methods. The package is successfully ported to Fugaku, the fastest supercomputer in the world, and shows capability of density functional theory, DFT, calculation of systems with more than 50000 atomic orbitals there.
NTChem is expected to be a useful tool in various computational studies for large and complicated molecular systems.
Currently NTChem2013 v12.1 is available on Fugaku, and we are open for collaboration.
Quantum gates are simulated in a numerically exact method by using the supercomputer Fugaku. Novel bit-swap algorithm for distributed computation is proposed in order to use memory as much as possible. Using this algorithm, it is possible to remove restriction in the previous bit-swap algorithm that the number of MPI processes is power of two.
How can we decrease to use memory more? (by data compression, using storage, etc.) How can we make MPI communication in our algorithm more efficiently?
This talk will focus on the in-house CFD solver CUBE (Complex Unified Building cubE) developed at R-CCS. CUBE is a unified solver framework capable of tacking various multi physics applications such as fluid flow, solid mechanics, fluid-structure interaction, combustion, droplet dynamics, etc. A brief overview of the background and the structure of the solver, and various applications for which it is used will be presented.
Dynamic Load balancing, AMR and Machine learning are area in which we are currently interested in expanding the capabilities of CUBE.
Molecular dynamics (MD) simulations have been widely used to explore biological phenomena that are difficult to study with experiments. Since all-atom MD simulations of large biomolecular complexes are computationally expensive, coarse-grained (CG) models based on different approximations have been developed. Here, we introduce an integrated implementation of several popular CG models in the MD program GENESIS. We developed a user-friendly toolbox to generate input files of residue-level CG models containing folded and disordered proteins, RNAs, and DNAs using a unified format and optimize the performance of CG MD simulations via efficient parallelization in GENESIS software. Our implementation will serve as a framework to develop novel CG models and investigate various biological phenomena in the cell.
We would like to explore the extendability of our program, to include more CG models for different biomolecules, and to integrate with force fields of different resolutions. Besides, it's also important to consider the compatibility of our program with the modeling tools.
We introduce new parallel computing algorithms in the GENESIS software to perform cellular-scale molecular dynamics (MD) simulations on Fugaku. It includes (1) the new algorithm of real-space non-bonded interactions , (2) reciprocal-space non-bonded interactions minimizing communicational cost, (3) accurate temperature/pressure evaluations that allows a large time step, and (4) effective parallel file inputs/outputs (I/O) for MD simulations of extremely huge systems. Based on these developments, GENESIS can perform MD simulations of the system containing 1.6 billion atoms with the performance of 8.30 ns/day. It will extend the available size and time of MD simulations to answer unresolved questions of biomolecules in a living cell.
I'm interested in effective file I/O of a very large system.
As compute node complexity in high-performance computing keeps increasing, systems equipped with heterogeneous memory devices are becoming prevalent. Efficiently utilizing heterogeneous memory based systems, however, poses significant challenges to application developers.
System software level transparent solutions utilizing artificial intelligence and machine learning based approaches, in particular non-supervised learning based methods such as reinforcement learning, may come to the rescue.
However, such methods require rapid estimation of execution runtime as a function of the data layout across memory devices for exploring different data placement strategies in an iterative fashion, rendering architecture level simulators (e.g., gem5, Ramulator, etc.) impractical for this purpose.
We investigate a differential tracing based approach where we obtain memory access traces using high frequency sampling (e.g., Intel's PEBS) on real hardware running out of different memory devices. We develop a runtime estimator based on such traces that provides a runtime estimate orders of magnitudes faster than full system simulators.
This talk introduces the approach we took, provides preliminary results on estimation accuracy and discusses future directions for its application to machine learning based heterogeneous memory management.
- Methods to enhance accuracy.
- Integration with machine learning.
- Evaluation on different platforms with heterogenous memory devices.
I am employing modern game engines to facilitate high-quality visualizations of generated plant models. My work includes the generation of ground truth data for machine learning algorithms and inspecting its use and training behavior. For this, I need high-quality plant simulations, parametrizations, and both graphical features as well as a robust pipeline for efficient network training. Currently, I am working on a distributed learning setup and the parameter spaces for my virtual scenes.
We need scene parametrizations in multitudes of ways to provide realism while enabling augmentation. We need to know precisely what to augment and what to ignore. There is need for learning architectures that support large image processing as well as optimizations over data usage and computational complexity in an extremely heterogeneous pipeline.
The increasing need for real-time analytics motivated the emergence of new incremental methods to learn representations from continuous flows of data, especially in the context of the Internet of Things. This trend led to the evolution of centralized computing infrastructures towards interconnected processing units spanning from edge devices to cloud data centers. This new paradigm is referred to as the Computing or Edge-to-Cloud Continuum. However, the network and compute heterogeneity across and within clusters may negatively impact Deep Learning (DL) training. I would like to introduce a roadmap for understanding the end-to-end performance of DL workloads in such heterogeneous settings. The goal is to identify key parameters leading to stragglers and devise novel intra- and inter-cluster strategies to address them.
I would like to identify potential collaborations to work in a heterogeneous, computing continuum type context. I am focusing on deep learning training workloads.
PyDDA (Pythonic Direct Data Assimilation) is a 3DVAR framework that assimilates data from an arbitrary number of weather radars together with other spatial wind fields in order to retrieve high resolution three dimensional wind fields. PyDDA minimizes a cost function which is a weighted sum of individual constraints using SciPyÕs implementation of L-BFGS-B. This implementation takes advantage of as many cores that are present in a single machine allowing for fast convergence. At Argonne we have ~27 TB of radar data in Darwin, Australia ready for testing PyDDA at scale. Because our problem exposes large scale data parallelism, it is highly suitable to be mapped to tensors in hardware.
We recently reimplemented PyDDA using JAX. We have recently attempted to run PyDDA on a Groq TSP, a Graphcore IPU, and an NVDIA GPU. In doing so, we attempted to reimplement PyDDA using PyTorch and Tensorflow. We were successful in reimplementing and validating PyDDA using Tensorflow. We will explain the challenges in this effort and further work that remains.
We are interested in porting all aspects of the code to the accelerators instead of the cost functions alone. We are also looking to port the code to other accelerators.
Funded by a recent ASCR project, we are developing a privacy-preserving federated learning framework that implements distributed learning algorithms for AI/ML models and data with privacy. We introduce the current status of the framework and discuss challenges and potential research directions.
The questions and collaboration opportunities include the integration of advanced data compression/encryption in federated learning, applications to edge devices, leverage and integration into existing frameworks that allow for model training on HPC, and federation of heterogeneous architectures.
In recent years Physics Informed Neural Networks (PINNs), neural networks that are aware of the underling physics of the problem they try to solve, have become a popular tool for the solution of Partial Differential Equations (PDEs). In this work we exploit ideas stemming from classical solution methods for PDEs, to accelerate the convergence of the training phase of the PINNs. In particular, drawing inspiration from multigrid methods, we consider two different training strategies, which both exploit the structure of the training problem to build a sequence of problems of decreasing dimension, approximating the original one. The knowledge of such hierarchy is exploited to reduce the major cost per iteration of classical optimization methods, that is the step computation. The proposed method thus allow for a faster training process, less dependant on the choice of the learning rate and more effective for problems with different modes.
The approach that we present has just been tested numerically, a meaningful research perspective is the development of its theoretical analysis. Another important direction that we would like to pursue is to test the method on real world problems.
Sparse linear algebra routines are fundamental building blocks of a large variety of scientific applications. Direct solvers, which are methods for solving linear systems via the factorization of matrices into products of triangular matrices, are commonly used in many contexts. The Cholesky factorization is the fastest direct method for symmetric and definite positive matrices. In this talk, we present selective nesting, a method to determine the optimal task granularity for the parallel Cholesky factorization based on the structure of sparse matrices. We propose the OPT-D-COST algorithm, which automatically and dynamically applies selective nesting. OPT-D-COST is the first approach to leverage matrix sparsity to drive complex task- based parallel workloads in the context of direct solvers. We run an extensive evaluation campaign considering a heterogeneous set of 60 sparse matrices and a parallel machine featuring the A64FX processor. OPT-D-COST delivers an average performance speedup of 1.46_ with respect to the best state-of-the-art parallel method to run direct solvers.
We plan to collaborate con other researchers to: 1) find new examples of applications where this technique could be applied 2) find new ways to determine the optimal granularity for Cholesky based on the structure of the non-zeros as several matrices still resist the proposed algorithm. We would like to investigate why they are different from the others. 3) discuss strategies to further reduce the idle time of the different threads (hybrid algorithms?)
Dense systems of linear equations are usually solved using Gaussian elimination with partial pivoting. However, the growing gap between computational power and communication bandwidth makes pivoting increasingly costly, especially on distributed, heterogeneous systems. We present our work on using randomized transforms to solve the systems accurately without pivoting.
The approach has been analyzed on artificial test matrices; however, these matrices may or may not be representative of linear systems that arise in practice. So, it would be valuable to collaborate with real applications in order to understand how well the approach works on real linear systems.
An important way to improve the scalability within orthogonalizing a set of basis vectors is to reduce the necessary communication therein. We present a version of Classical Gram-Schmidt with reorthogonalization (CGS2) which reduces the synchronizations per iteration from three to one. The focus of this work is in the context of a QR factorization and Krylov expansions, where the expansion is one-column-at-a-time and "left" - looking.
An extension to the presented work is to consider having access to more than one-column-at-a-time and orthogonalizing a set of basis vectors by block. Additionally, by delaying varying operations to achieve fewer synchronizations, this reduces the number of passes on data which can lead to a better memory footprint. Can this be extended beyond an orthogonalization scheme to any additional algorithms of interest?
In symmetric tridiagonal NxN matrices with very large N, we find K eigenvalues and eigenvectors in a thin "slice" of the eigenspectrum, so K << N. These must be orthogonalized to return to the user; and this O(N*K^2) orthogonalization (currently done with TSQR (tall-skinny QR.) takes up to 95% of the runtime even in parallel code with 72 cores.
The ST eigenvalues should be distinct and the vectors returned should be orthogonal. This is the case when eigenvalues are "well separated". The non-orthogonality occurs in clusters where eigenvalues are very close together, either absolutely, or relative to their magnitude. (This is not well characterized, just empirically noted.)
The current TSQR orthogonalizes every vector against every other, but there may be an opportunity to get large performance improvements here by SELECTIVE orthogonalization, if we can use the vector of eigenvalues to characterize smaller clusters of eigenvectors that need to be orthogonalized.
It is an open question whether this kind of partitioning is possible. Empirically, in our test matrices, we've seen some eigenvector sets that need no orthogonalization, others with just one cluster of about (K/10) that needed it. But O(N*(K/10)^2) would be a 100-fold speedup. The Wilkinson matrix seems to need only pairwise orthogonalization, an O(N*K) operation instead of O(N*K^2). Other collaboration might be providing real-world test matrices where spectrum slicing applies, we know some applications are in quantum physics.
Geostatistics predicts desired quantities from geographically distributed data, based on statistical models and optimization of parameters. A primary computational kernel is the evaluation of the maximum log-likelihood estimation (MLE) function, whose central data structure is a dense, symmetric, and positive definite covariance matrix and two essential operations, the application of the inverse and evaluation of the determinant. Both operations can be rendered through the Cholesky decomposition and triangular solution. The key observation is that environmental characteristics typically exhibit loss of correlation with distance, which motivates our study with mixed-precision factorization.
We incorporate PaRSEC dynamic runtime system into the ExaGeoStat framework. By utilizing algorithmic, architectural, and programming model features of PaRSEC, ExaGeoStat-PaRSEC is able to employ numerical approximations by means of mixed-precision computations to reduce the arithmetic complexity and memory footprint. We validated the numerical precision of our proposed method with synthetic datasets and a real world test case and highlighted significant performance improvements on large scale multi-GPU systems (Summit). To our knowledge, this is the first time a three-precision Cholesky solver is deployed on large-scale GPU-based systems. This may permit us to achieve exascale geospatial applications, while providing new statistical insights at unprecedented scales.
This is joint work with Jan Hueckelheim, ANL. Our goal is to apply Algorithmic Differentiation, in its adjoint (aka reverse) mode, to OpenMP source code. We describe transformation rules, which we justify through the PGAS (Partitioned Global Address Space) formalism, then focus on the main issue of the race conditions introduced by adjoint AD. We investigate several approaches to address this issue and measure their performance on classical benchmarks.
Issue: efficient implementation of atomic operations, especially of the "+=" kind Issue: reusing data-dependency analysis of an original code, for its adjoint differentiated code.
Today's scientific high performance computing (HPC) applications or advanced instruments are producing vast volumes of data across a wide range of domains, which introduces a serious burden on data transfer and storage. Error-bounded lossy compression has been developed and widely used in scientific community, because not only can it significantly reduce the data volumes but it can also strictly control the data distortion based on the use-specified error bound. Existing lossy compressors, however, cannot offer ultra-fast compression speed, which is highly demanded by quite a few applications or use-cases (such as in-memory compression and online instrument data compression). In this talk, we propose a novel ultra-fast error-bounded lossy compressor, which can obtain fairly high compression performance on both CPU and GPU, also with reasonably high compression ratios. The key contributions are three-fold: (1) We propose a novel, generic ultra-fast error-bounded lossy compression framework -- called SZx, by confining our design to be composed of only super-lightweight operations such as bitwise and addition/subtraction operation, still keeping a certain high compression ratio. (2) We implement SZx on both CPU and GPU and optimize the performance according to their architectures carefully. (3) We perform a comprehensive evaluation with 10 real-world production-level scientific datasets on both CPU and GPU. Experiments show that SZx is 2~16X as fast as the second-fastest existing error-bounded lossy compressor (either SZ or ZFP) on CPU and GPU, with respect to both compression and decompression.
Why is lossy compression speed very important? How to leverage ultra-fast lossy compression in practice?
On large clusters, communication become more and more the bottleneck. With the upcoming exascale machines, the gap between the computation and the network bandwidth is even larger than before.
In this talk, we consider different compression techniques such as casting operation and show how their integration into the communication phase can be beneficial to an application. Moreover, in the case of mixed-precision, where a loss of accuracy is acceptable, this approach improves the execution time and is able to also reach a batter accuracy. We illustrate this in the context of a 3D FFT code, where our approach gives one order of magnitude better accuracy than a full execution in lower precision.
We want to understand how this approach can be integrated into existing codes.
Lossy compression plays a growing role in scientific simulations where the cost of storing their output data can span terabytes. Using error-bounded lossy compression reduces the amount of storage for each simulation; however, there is no known bound for the upper limit on lossy compressibility. Correlation structures in the data, choice of compressor and error bound are factors allowing larger compression ratios and improved quality metrics. Analyzing these three factors provides one direction towards quantifying limits of lossy compressibility. As a first step, we explore statistical methods to characterize correlation structures present in the data and their relationships, through functional regression models, to compression ratios. In particular, compression ratios of the widely used lossy compressors for scientific data SZ, ZFP and MGARD exhibit a logarithmic dependence to the global and local correlation ranges when combined with information on the variability of the considered fields through the variance or gradient magnitude. Further works will focus on providing a unified characterization of these relationships across compressors and error bounds. This consists of a first step towards evaluating the theoretical limits of lossy compressibility used to eventually predict compression performance and adapt compressors to correlation structures present in the data.
Existence and characterization of theoretical limits of lossy compressibility
Recent advances in Deep Neural Networks (DNNs) have demonstrated a promising potential in predicting the temporal and spatial proximity of time evolutionary data. In this short talk, we introduce an effective (de)compression framework called TEZIP that can support dynamic lossy and lossless compression of time evolutionary image frames with high compression ratio and speed.
Can we extend TEZip with more domain specific predictor ?
This presentation will summarize recent improvements in capabilities for generic lossy compression using LibPressio powered by compressors such as SZ, ZFP, MGARD, and others including improvements to blackbox compression configuration techniques and parallel compression.
Open Questions/Collaboration Opportunities:
1. What are the remaining gaps in capabilities of compression interfaces for applications? (i.e. better support for asyncrony, accelerators, streaming compression, point cloud, and/or sparse compression, etc...). Ideal collaborators have applications that are considering/using compression, but would like to consider ways to improve performance or evaluate different compression methodologies.
2. How can we further drive down the costs of black box compression techniques to enable their use by a wider community? (i.e. faster search techniques that use fewer invocations of the compressors, proxy-model based approaches, techniques to use fewer tunings, etc...) Ideal collaborators have experience in numerical optimization/black-box optimization techniques, transfer learning inspired optimization approaches, or a related subject and have interest in apply this to a inter-disciplinary problem of bounding user metrics.
The organizers of the 13th JLESC Workshop are dedicated to providing a harassment-free experience for everyone, regardless of gender, gender identity and expression, age, sexual orientation, disability, physical appearance, body size, race, ethnicity, religion (or lack thereof), technology choices, or other group status.
To make clear what is expected, everyone taking part in the event - speakers, helpers, organizers, and participants - is required to conform to the Berlin Code of Conduct. The full text of the Code of Conduct can be found at http://berlincodeofconduct.org/.
To give a brief overview here, you are expected to:
The following behavior is unacceptable: intimidating, harassing, abusive, discriminatory, derogatory or demeaning speech or actions by any participant in our community online, at all related events and in one-on-one communications carried out in the context of community business.
Harassment includes harmful or prejudicial verbal or written comments related to gender, sexual orientation, race, religion, disability; inappropriate use of nudity and/or sexual images (including presentation slides); inappropriate depictions of violence (including presentation slides); deliberate intimidation, stalking or following; harassing photography or recording; sustained disruption of talks or other events.
If you witness or are subject to unacceptable behavior, please contact one of the workshop organizers via email or Slack. You can do so anonymously.