The workshop gathers leading researchers in high-performance computing from the JLESC partners INRIA, the University of Illinois, Argonne National Laboratory, Barcelona Supercomputing Center, Jülich Supercomputing Centre, RIKEN R-CCS and The University of Tennessee to explore the most recent and critical issues in advancing the field of HPC from petascale to the extreme scale era.
The workshop will feature sessions on these eight central topics:
In addition to these tracks, dedicated sessions targetting more specialized scientific domains are planned. The target domains change for each meeting depending on the needs and interests of the JLESC community. For this meeting the target domains are computational fluid dynamics, computational biology and climate/weather research.
A key objective of the workshop is to identify new research collaborations and establish a roadmap for their implementation.
Most of the workshop is open to all participants from the JLESC institutions Illinois, INRIA, ANL, BSC, JSC, Riken R-CCS and UTK; faculties, researchers, engineers and students who want to learn more about Post-Petascale / Pre-Exascale Computing. In addition to the schedule with restricted participation, the 12th JLESC meeting will feature during the last workshop's day, February 26th 2021, as an open day, where attendance is open to anybody interested by any of the workshop related topics.
The OpenDay features 3 invited talks from leaders in the field presented along with 3 success stories from JLESC teams. The OpenDay invited speakers are Prof. Satoshi Matsuoka (Riken), Dr. Lois Curfman McInnes (ANL) and Prof. Torsten Hoefler (ETH).
https://www.virtualchair.net/events/jlesc
https://zoom.virtualchair.net/jlesc/Audience/379aMA
Track 1 (Location: Room 1) | BOS (Location: Plenary) | |
08:00 ET | Opening Remarks (Location: PLENARY)
|
|
08:15 ET |
ST M1.1 Advanced Computing
Session chair: Philippe Swartvagher |
Python in Parallel Computing |
10:00 ET | Break (Activities on zoom and gather.town) | |
10:15 ET | Panel: Open challenges in scheduling for parallel computing
(Location: PLENARY)
Moderator: Yves Robert, Inria Panelists: - Rosa M. Badia, BSC - George Bosilca, UTK - Arnaud Legrand, Inria - Swann Perarnau, ANL - Marc Snir, UIUC - Miwako Tsuji, Riken |
|
11:15 ET | Meeting venues (zoom session and gather.town) will remain open until 1PM ET |
Track 1 (Location: Room 1) | Track 2 (Location: Room 2) | BOS (Location: Plenary) | |
08:00 ET |
ST M2.1 (6) AI and Applications
Session chair: Daniel Barry |
ST M2.2 (6) I/O
Session chair: Daichi Mukunoki |
ARM |
9:30 ET | Break (Activities on zoom and gather.town) | ||
09:45 ET |
ST M2.3 (6) Performance tools and numerical methods
Session chair: Kevin Sala |
ST M2.4 (6) Programming languages and runtimes
Session chair: Ruth Schöbel |
Heterogeneous and reconfigurable architectures for the future of computing |
11:15 ET | Closing Remarks (Location: PLENARY)
|
||
11:20 ET | Meeting venues (zoom session and gather.town) will remain open until 1PM ET |
Track 1 (Location: Plenary)
Zoom link: ![]() |
|
08:00 ET | Opening remarks
|
08:15 ET | Prof. Satoshi Matsuoka, Fugaku: the first 'Exascale' supercomputer |
08:45 ET | Dr. Gabriel Antoniu, A Story About Data: Advancing Storage, I/O and Processing at Challenging Scales |
09:15 ET | Dr. Lois Curfman McInnes, How a Community Software Ecosystem Perspective Helps to Advance Science Goals in the Exascale Computing Project |
9:45 ET | Break (BYOC: aka. Bring Your Own Coffee) |
10:00 ET | Dr. Leo Bautista, Resilience for Extreme Scale Computing |
10:30 ET | Prof. Torsten Hoefler, High-Performance Deep Learning |
11:00 ET | Dr. Brian Wylie, Developer tools for porting & tuning parallel applications on extreme-scale systems |
11:30 ET | Closing remarks
|
Title | Presenter | |
---|---|---|
Fugaku: the first 'Exascale' supercomputer | Prof. Satoshi Matsuoka Riken CCS | |
Abstract: Fugaku is the first ‘exascale’ supercomputer of the world, not due to its peak double precision flops, but rather, its demonstrated performance in real applications that were expected of exascale machines on their conceptions 10 years ago, as well as reaching actual exaflops in new breed of benchmarks such as HPL-AI. But the importance of Fugaku is \"applications first\” philosophy under which it was developed, and its resulting mission to be the centerpiece for rapid realization of the so-called Japanese ‘Society 5.0’ as defined by the Japanese S&T national policy. As such, Fugaku’s immense power is directly applicable not only to traditional scientific simulation applications, but can be a target of Society 5.0 applications that encompasses conversion of HPC & AI & Big Data as well as Cyber (IDC & Network) vs. Physical (IoT) space, with immediate societal impact with its technologies utilized as Cloud resources. In fact, Fugaku is already in partial operation a year ahead of schedule, primarily to obtain early Society 5.0 results including combatting COVID-19 as well as resolving other important societal issues and also go into full production in moments time. | ||
How a Community Software Ecosystem Perspective Helps to Advance Science Goals in the Exascale Computing Project | Dr. Lois Curfman McInnes Argonne National Laboratory | |
Abstract: Teams in the U.S. Exascale Computing Project (ECP) are working toward scientific advances on forthcoming exascale platforms, across a diverse suite of applications in chemistry, materials, energy, Earth and space science, data analytics, optimization, artificial intelligence, and national security. In turn, these applications build on software components, including programming models and runtimes, mathematical libraries, data and visualization packages, and development tools that comprise the Extreme-scale Scientific Software Stack (E4S). E4S represents a portfolio-driven effort to collect, test, and deliver the latest in reusable open-source HPC software products, as driven by the common needs of applications. E4S establishes product quality expectations and provides a portal as a starting point for access to product documentation. This presentation will discuss early experiences with how this software ecosystem approach delivers the latest advances from ECP software technology projects to applications, thereby helping to overcome software collaboration challenges across distributed aggregate teams. A key lesson learned is the need for close collaboration between teams developing applications and reusable software technologies, as well as the need for crosscutting strategies to increase developer productivity and software sustainability, thereby mitigating technical risks by building a firmer foundation for reproducible, sustainable science. | ||
High-Performance Deep Learning | Prof. Torsten Hoefler ETH Zurich | |
Abstract: Deep Learning is as computationally expensive as the most challenging scientific computing applications. In this talk, we outline the biggest challenges in training deep learning workloads and show how HPC techniques can be used to improve the performance of training workloads. We focus on model sparsity in the training process. This will be even more important once the scientific computing community uses deep learning in their workflows. | ||
A Story About Data: Advancing Storage, I/O and Processing at Challenging Scales | Dr. Gabriel Antoniu INRIA | |
Abstract: Looking back over more than 10 years of collaboration within JLESC involving Inria, the University of Illinois at Urbana-Champaign and Argonne National Lab, this talk will highlight a few achievements on hot topics related to data storage, I/O management and in situ visualisation and processing. From these initial challenges in this areas posed by the expected arrival of Exascale systems, new ones emerged as frontiers started to blur between High-Performance Computing and Big Data analytics. We will also discuss upcoming open problems triggered by the increasingly complex workflows that are mixing simulations, analytics and AI, which emphasize new requirements and opportunities created by their potential execution on the HPC/Cloud/Edge computing continuum. | ||
Resilience for Extreme Scale Computing | Dr. Leo Bautista BSC | |
Abstract: Resilience has been one of the main research topics of the JLESC since its conception over a decade ago. We have covered multiple types of failures and errors which has led to completely different fault tolerance techniques, some of them at the intersection of HPC and ML. The research work, carried out by JLESC researchers from five different institutions, shows a strong interaction between theoretical analysis and practical implementations. The results of this endeavor had led to multiple collaboration visits, dozens of publications and hundreds of citations; but more interestingly, it has opened new questions and it has shown connections between HPC fields that we didn't know were connected before. In this talk we will go over this trajectory and get a quick glance of what might come in the future for HPC resilience. | ||
Developer tools for porting & tuning parallel applications on extreme-scale systems | Dr. Brian Wylie JSC | |
Abstract: Application developers targeting extreme-scale HPC systems such as Fugaku, and modular supercomputing architectures such as JUWELS, need effective tools to assist with porting and tuning for these unusual systems. This collaborative project brings together developers of such tools from JLESC partners to investigate their integration and support joint training activities as the tools are deployed and applied to a variety of application codes. |
Satoshi Matsuoka from April 2018 has become the director of Riken CCS, the top-tier HPC center that represents HPC in Japan, developing and hosting Japan’s tier-one ‘Fugaku’ supercomputer which has become the fastest supercomputer in the world in all four major supercomputer rankings, along with multitudes of ongoing cutting edge HPC research being conducted, including investigating Post-Moore era computing.
He had been a Full Professor at the Global Scientific Information and Computing Center (GSIC), the Tokyo Institute of Technology since 2000, and the director of the joint AIST-Tokyo Tech. Real World Big Data Computing Open Innovation Laboratory (RWBC-OIL) since 2017, and became a Specially Appointed Professor at Tokyo Tech in 2018 along with his directorship at R-CCS.
He has been the leader of the TSUBAME series of supercomputers that have won many accolades such as world #1 in power-efficient computing. He also leads various major supercomputing research projects in areas such as parallel algorithms and programming, resilience, green computing, and convergence of big data/AI with HPC.
He has written over 500 articles according to Google Scholar, and chaired numerous ACM/IEEE conferences, including the Program Chair at the ACM/IEEE Supercomputing Conference (SC13) in 2013. He is a Fellow of the ACM and European ISC, and has won many awards, including the JSPS Prize from the Japan Society for Promotion of Science in 2006, presented by his Highness Prince Akishino; the ACM Gordon Bell Prize in 2011; the Commendation for Science and Technology by the Minister of Education, Culture, Sports, Science and Technology in 2012; the 2014 IEEE-CS Sidney Fernbach Memorial Award, the highest prestige in the field of HPC; HPDC 2018 Achievement Award from ACM; and recently SC Asia 2019 HPC Leadership Award.
Lois Curfman McInnes is a senior computational scientist in the Mathematics and Computer Science Division of Argonne National Laboratory. Her work focuses on high-performance computational science and engineering, with emphasis on scalable numerical libraries and community collaboration toward productive and sustainable software ecosystems. Lois has developed numerical algorithms and software for the parallel solution of large-scale scientific applications involving nonlinear partial differential equations and related optimization problems in the PETSc/TAO libraries. In February 2020, she began serving as Deputy Director of Software Technology in the US DOE Exascale Computing Project. Lois also co-leads the IDEAS project, whose members are partnering with the community to improve software productivity and sustainability as a key aspect of advancing overall scientific productivity. Previously, she served as the technical area lead of Mathematical Libraries in ECP’s Software Technology thrust and lead of the xSDK project—community collaboration to advance code quality, access, and interoperability while working toward a software ecosystem for high-performance numerical libraries. Lois is a SIAM Fellow. She won the 2015 SIAM/ACM Prize in Computational Science and Engineering and received an R&D 100 Award in 2009 (with collaborators) for work on the PETSc library; she also won an E.O. Lawrence Award in 2011 for outstanding contributions in research and development supporting DOE and its missions. She received a PhD in applied mathematics from the University of Virginia and a BS in mathematics and physics from Muhlenberg College. Lois served as Chair (2015–16) and Program Director (2013–14) of the SIAM Activity Group on Computational Science and Engineering, and she currently serves on the SIAM Council and the editorial board of SIAM News. More information: https://www.mcs.anl.gov/~mcinnes.
Torsten Hoefler directs the Scalable Parallel Computing Laboratory (SPCL) at D-INFK ETH Zurich. He received his PhD degree in 2007 at Indiana University and started his first professor appointment in 2011 at the University of Illinois at Urbana-Champaign.
Torsten has served as the lead for performance modeling and analysis in the US NSF Blue Waters project at NCSA/UIUC. Since 2013, he is professor of computer science at ETH Zurich and has held visiting positions at Argonne National Laboratories, Sandia National Laboratories, and Microsoft Research Redmond (Station Q).
Dr. Hoefler's research aims at understanding the performance of parallel computing systems ranging from parallel computer architecture through parallel programming to parallel algorithms. He is also active in the application areas of Weather and Climate simulations as well as Machine Learning with a focus on Distributed Deep Learning. In those areas, he has coordinated tens of funded projects and an ERC Starting Grant on Data-Centric Parallel Programming.
He has been chair of the Hot Interconnects conference and technical program chair of the Supercomputing and ACM PASC conferences. He is associate editor of the IEEE Transactions of Parallel and Distributed Computing (TPDS) and the Parallel Computing Journal (PARCO) and a key member of the Message Passing Interface (MPI) Forum.
He has published more than 200 papers in peer-reviewed international conferences and journals and co-authored the latest versions of the MPI specification. He has received best paper awards at the ACM/IEEE Supercomputing Conference in 2010, 2013, and 2014 (SC10, SC13, SC14), EuroMPI 2013, IPDPS'15, ACM HPDC'15 and HPDC'16, ACM OOPSLA'16, and other conferences. Torsten received ETH Zurich's Latsis Prize in 2015, the SIAM SIAG/Supercomputing Junior Scientist Prize in 2012, the IEEE TCSC Young Achievers in Scalable Computing Award in 2013, the Young Alumni Award 2014 from Indiana University, and the best student award 2005 of the Chemnitz University of Technology. Torsten was elected into the first steering committee of ACM's SIGHPC in 2013 and he was re-elected in 2016.
Gabriel Antoniu is a Senior Research Scientist at Inria, Rennes, where he leads the KerData research team. His recent research interests include scalable storage, I/O and in situ visualization, data processing architectures favoring the convergence of HPC, Big Data analytics and AI. He has served as a PI for several international projects in these areas in partnership with Microsoft Research, IBM, ATOS/BULL, Argonne National Lab, the University of Illinois at Urbana Champaign, Universidad Politécnica de Madrid, Barcelona Supercomputing Center. He served as Program Chair for the IEEE Cluster conference in 2014 and 2017 and regularly serves as a PC member of major conferences in the area of HPC, cloud computing and Big Data analytics (SC, HPDC, CCGRID, Cluster, Big Data, etc.). He co-authored over 150 international publications in the aforementioned areas. More information: https://team.inria.fr/kerdata/gabriel-antoniu/.
Dr. Leonardo Bautista-Gomez is an expert in reliability for high-performance computing (HPC). He is a Senior Research Scientist at the Barcelona Supercomputing Center (BSC) where he works in multiple H2020 European projects related to resilience, energy efficiency and multilevel storage systems for HPC. He is currently involved in multiple European projects, leading a team of Ph.D. students and engineers on the development of resilience techniques for large scale energy-efficient heterogeneous systems and machine learning workflows.
In 2016 he was awarded with a European Marie-Sklodowska Curie Actions fellowship on Deep-memory Ubiquity, Resilience and Optimization. In addition, he was awarded the 2016 IEEE TCSC Award for Excellence in Scalable Computing, Early Career Researcher. Before moving to BSC he was a Postdoctoral researcher at the Argonne National Laboratory, where he investigated data corruption detection techniques and error propagation. Most of this research was focused on how errors affect scientific applications and how to detect silent corruption using machine learning tools.
Leonardo did his PhD. in resilience for supercomputers at the Tokyo Institute of Technology. He developed a scalable multilevel checkpointing library called Fault Tolerance Interface (FTI) to guarantee application resilience at extreme scale. For this work, he was awarded the 2011 ACM/IEEE George Michael Memorial High Performance Computing Ph.D. Fellow at Supercomputing Conference 2011 (SC11), Honorable Mention. Furthermore, he got a Special Certificate of Recognition for achieving a perfect score at SC11 for the paper : “FTI : High-Performance Fault Tolerance Interface for Hybrid Systems”. Moreover, he was awarded the Japanese Society for the Promotion of Science (JSPS), Research Fellowships for Young Scientists, Doctoral Course.
Before moving to Tokyo Tech., he graduated in Master for Distributed Systems and Applications from the Paris 6 University, Pierre & Marie Curie. Prior to this, he obtained a Bachelor in Computer Science from the Paris 6 University, Pierre & Marie Curie.
Brian J. N. Wylie has been a scientific researcher in Juelich Supercomputing Centre since 2004, in the group developing the Scalasca toolset for scalable performance analysis of large-scale parallel applications. He established and continues to contribute to tools training activities of the Virtual Institute -- High Productivity Supercomputing (VI-HPS). His current focus is the assessment of exascale readiness of applications comprising very large numbers of processes and threads within the Performance Optimisation and Productivity (POP) EU Centre of Excellence in HPC.
This works will present a preliminary design and result of a Vivado HLS implemented lossy compression for Xilinx FPGAs based on updated SZ algorithm.
Looking for new use scenarios for FPGA/ASIC-based SZ lossy compression
This work discusses use cases of state preservation, which generalizes checkpointing into productive use cases (in addition to resilience). It focuses in particular on data management for deep learning applications, for which the state is either pre-processed training samples or partially trained models.
How deep learning application can use model checkpoints and dependencies between them to build better training strategies? How can caching of training data reduce sampling overhead? How can add replay support to streams of training data to avoid catastrophic forgetting?
Non-deterministic communication is an important part of ensuring swift parallel execution of MPI applications, but can result in unpredictable, sometimes problematic, application behaviors. Specifically, non-determinism in executions can cause non-reproducible bugs and numerical non-reproducibility. Gathering adequate data to characterize non-deterministic behaviors in a given application is often costly at a minimum, and seemingly impossible at worst. While tools exist to assist in these efforts, there are limitations. For example, some tools focus entirely upon catastrophic bugs caused by non-determinism, while others, especially record-and-replay tools, enforce a particular execution pattern upon an application to guarantee reproducibility of a result. There is inadequate understanding of root sources of non-determinism for broader classes of bugs and scientific applications.
We present a software framework for identifying root sources of non-deterministic behavior in MPI applications and quantifying the application’s degree of non-determinism at different scales through the lens of graph similarity. Specifically, we devise a novel extension of event graph modeling in which we transform graph kernels to become a quantitative proxy for communication non-determinism of MPI applications. Our framework supports codes integrating four diverse MPI communication patterns: the simple but ubiquitous message race, the receiver-side and sender-side non-deterministic patterns in the Algebraic Multigrid 2013 Benchmark (AMG2013) [1], the non-blocking MPI 2-dimensional grid communication pattern (MCB Grid) in the Monte Carlo Benchmark [2], and the Unstructured Mesh pattern [3] with its randomized process topology. We show results for a message race benchmark at different scales. Our workflow has the potential to reduce costs of debugging and justify the soundness of MPI application results, even in the presence of numerical irreproducibility.
[1] J. Park, M. Smelyanskiy, U. M. Yang, D. Mudigere, and P. Dubey, 'High-performance algebraic multigrid solver optimized for multi-core based distributed parallel systems,' in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2015, pp. 1–12.
[2] N. Gentile and B. Miller, 'Monte carlo benchmark (MCB),' https://computing.llnl.gov/projects/co-design/mcb, LLNL-CODE-507091.
[3] N. Jain and A. Bhatele, 'Chatterbug communication proxy applications suite,' https://github.com/LLNL/chatterbug, LLNL-CODE-756471.
We are looking for collaborators who have MPI applications (with point-to-point communications) and non-deterministic behaviour to serve as testing use cases for our framework. We have used the framework for miniAMR and MCB runs.
We are interested in collaborating with colleagues who are using graph theory or other methods to identify aspects associated with non-determinism such as bugs, differences in results, or other reasons.
We are interested in the state of MPI noise injection moving into QMPI. In the past, various noise injection techniques have existed, such as NINJA and Jitterbug. We want to use noise injection to study large scale communication nondeterminism with noise-injected smaller scale tests as a proxy. Tools that are built with QMPI in mind are needed to support tool stacks in future MPI implementations.
Microsoft, Google and other large cloud companies have been developing their own custom chips in-house. Such custom hardware movement is pushed by the end of the Moore's law. Being said that new custom hardware development still requires serious efforts. The good news is that the advent of open-source hardware movement and open-source friendly semiconductor industry seems to change the perception of hardware development gradually and may eventually create a true co-design playground where software and hardware experts work together. In the meanwhile, I had an opportunity to be involved in Argonne's next-gen X-ray detector ASIC project. I give a brief summary of custom hardware movement and our recently-started X-ray detector ASIC project and compressor designs as well as a new hardware design methodology using Chisel, Scala-based hardware construction language.
One question is how can hardware customization help our work? Other question is how can we leverage open-source hardware movement and ecosystem? FPGAs allow us to create custom hardware in exchange of the overhead from reconfigurability and demonstrated their performance on certain applications. Application-specific integrated circuits (ASICs) offer significantly better performance than FPGAs on similar logic. The development cost of custom ASIC may still be prohibitable. A question is what will happen if the cost of ASIC development becomes cheap enough? A technology like structured ASICs (a write-once FPGAs) offers higher performance than FPGAs and lower cost than ASICs. Technologies are available. A question is how can we translate our software algorithm into hardware?
In terms of opportunists, studying hardware algorithms and translation mechanisms for our workloads is the most straightforward topic. Another topic is pre-processing of data from physics experiment to HPC systems, where data movement is the biggest issue. How can we manage data in a holistic view? (realtime loss-less compression at edge, lossy compression at HPC, encryption, etc). There could be opportunities for realtime AI inference at in/near detector. Lastly, studying hardware tools (e.g., place and route) and design tools such as Chisel themselves is also an interesting topic (dataflow, functional programming, scalability, optimization algorithms, etc).
New research ideas require an instrument where they can be developed, tested – and shared. To support Computer Science research such instrument has to provide access to a diversity of hardware configurations, support deployment at scale, and deep reconfigrability so that a wide range of experiments can be supported. It also has to provide mechanisms for sharing so that new results can multiply their value by triggering further innovation. Most importantly -- since science does not stand still – such instrument requires constant adaptation to support an ever increasing range of experiments driven by emergent ideas and opportunities.
The NSF-funded deeply reconfigurable Chameleon testbed for Computer Science research and education (www.chameleoncloud.org) has been developed to provide all those capabilities. The testbed provides many thousands of cores and over 5PB of storage hosted at sites at the University of Chicago and the Texas Advanced Computing Center (TACC) connected by 100 Gbps network. The hardware consists of a large homogenous partitions to facilitate experiments at scale along an investment in diversity consisting of a range of accelerators, storage hierarchy nodes with a mix of HDDs, SDDs, NVMe, and large RAM, high-bandwidth I/0 storage, SDN-enabled networking hardware, and fast interconnects. To support Computer Science experiments, ranging from operating system and virtualization research Chameleon provides a configuration system giving users full control of the software stack: provisioning of bare metal, reboot, power on/off and console access. To date, the testbed has supported over 5,000 users and almost 700 research and education projects and has just been renewed for four more years of operation.
This talk will provide an update on the design strategy, and the existing and future capabilities of the testbed, as well as some of the research and education projects our users are working on. I will also introduce the services and tools we created to support sharing of experiments, curricula, and other digitally expressed artifacts that allow science to be shared via active involvement and foster reproducibility. Finally, I will describe how Chameleon is adapting to support new research directions and hope to discuss how we can most effectively move to support future research together.
Some open questions: what capabilities/hardware/scientific instrument features are needed to support the research of JLESC participants? How can we best collaborate with the ecosystem of testbeds growing out of the original Grid5000 testbed? What new research directions should we support?
The ever-increasing number of computation units assembled in current HPC platforms leads to a concerning increase in fault probability. Traditional checkpoint/restart strategies avoid wasting large amounts of computation time when such fault occurs. With the increasing amount of data processed by today's applications, these strategies, however, suffer from their data transfer demand becoming unreasonable, or the entailed global synchronizations.
The current trend towards task-based programming is an opportunity to revisit the principles of the checkpoint/restart strategies. We propose a checkpointing scheme which is closely tied to the execution of task graphs. We describe how it allows for completely asynchronous and distributed checkpointing, as well as localized node restart, thus allowing for very large scalability. We also show how a synergy between the application data transfers and the checkpointing transfers can lead to a reasonable additional network load, measured to be lower than +10% on a dense linear algebra example.
This work is a use case of ULFM, we are open to collaborate to improve our work and also to provide StarPu for application developper who wants to try this runtime. Tapioca seems to be an interesting way to implement a multi-level checkpointing policy, which is lacking in our current work.
We developed a parallel library to accelerate programs where multiple parallel workers update a shared resource, and sometimes concurrently try to update the same location. If the conflict rate is small, our library outperforms many alternatives. This can be used for graph algorithms, sparse matrix operations, back-propagation, etc.
While our initial idea appears promising and was published already, the library can be extended with new acceleration strategies, implementations for other hardware (we currently only deal with multi-core CPUs, not GPUs etc), and we are looking for more applications that could benefit from our work.
Neural network (NN) models are increasingly utilized by scientific simulations to extract knowledge from datasets. Tailoring NNs to specific scientific datasets is time-consuming and error-prone. Neural architecture search (NAS) automates the design of NN architectures. NAS attempts to find well-performing NN models for specialized datasets, where performance is measured by key metrics that capture NN capabilities. However, existing NAS methods are resource intensive, especially when searching for highly accurate models for large datasets. To address this problem, we present PEng4NN, a performance estimation engine that predicts neural network performance early in training. PEng4NN plugs into existing neural architecture search (NAS) methods; it predicts the final accuracy of the NNs that the NAS selects from the search space and reports these predictions to the NAS, enabling early training termination of the NNs in the NAS. PEng4NN has exhibited on average savings of 60% - 80% of training epochs needed in the NAS, which enables a throughput gain of 2.5 to 5 times. This enables the NAS to use fewer resources, explore more architectures, or explore a larger search space.
A. Keller Rorabaugh, S. Caino-Lores, M. R. Wyatt II, T. Johnston, M. Taufer 'PEng4NN: An Accurate Performance Estimation Engine for Efficient Automated Neural Network Architecture Search' https://arxiv.org/abs/2101.04185, 2021.
We seek collaborations with researchers in need of NNs tailored to their scientific datasets, where a NAS augmented by PEng4NN can be utilized.
We want to test PEng4NN on a broad range of NAS applications. To that end, we seek collaboration with any NAS developers interested in increasing the efficiency of their NAS.
Currently, PEng4NN develops a predictive model for accuracy. In the future, we plan to extend PEng4NN to predict loss as well and would be pleased to collaborate with researchers who are interested in modeling loss curves.
A protein’s structure determines its function. Different proteins have different structures; proteins in the same family share similar substructures and thus may share similar functions. Additionally, one protein may exhibit several structural states, also named conformations. Identifying different proteins and their conformations can help solve problems such as determining the cause of diseases and designing drugs. X-ray Free Electron Laser (XFEL) beams are used to create diffraction patterns (images) that can reveal protein structure and function. The translation from diffraction patterns in the XFEL images to protein structures and functionalities is nontrivial.
We present our XPSI (XFEL-based Protein Structure Identifier) that relies on ML methods (autoencoder and kNN) to capture key information that allows the identification of properties, such as spatial orientation and protein conformation from the diffraction patterns. In our previous talk, we explored data with two resolutions for one protein with two conformations. For this edition, we show the expansion of our framework for more complex protein diffraction imaging datasets: three imaging resolutions, two proteins with four conformations each, and the addition of a 3rd rotational angle for spatial orientation. We quantify the classification accuracy and performance of XPSI, obtaining an orientation prediction error up to 10 degrees and two-conformation accuracy prediction of 95%.
The addition of imaging factors in our experimental setup, such as resolution, rotation, and symmetric proteins, increases the complexity of our data and prediction task. Therefore, employing new techniques such as VAEs (Variational autoencoders) or end-to-end deep learning frameworks are promising next steps for our framework.
This project is collaborative research between RIKEN, GCLab, and ICL.
What is the suitability of the current framework for real-world cases (e.g., 3-D reconstruction of protein in general and ribosome in particular, annotation of images for their classification without human intervention) ?
What is the range of errors (i.e., distribution of orientation and/or conformation) tolerable for scientists?
What are the costs in terms of computational requirements (execution time and resources) for other methods that identify proteins’ properties used by the scientists and how does our framework compare?
Can we use systematic data from simulations to train the framework and use the trained models for classification/prediction of real experiment data successfully?
How can we augment the framework to also classify / predict protein structures (currently the framework captures orientations and conformations)?
What aspects from the science, such as, rotational orientation, beam resolution, symmetrical proteins, can presumably limit the functionality of the framework?
We present a framework to natively execute CUDA code on CPU. Porting existing GPU codes to Fugaku is tedious, especially when it comes to DL (e.g. Tensorflow and Pytorch). Hence, we explore an alternative approach and present the application a vitual GPU and re-implementations of CUDA libraries for CPU, to execute those CUDA applications natively. We demonstrate a PoC of our idea and show early results of running Resnet50 with pytorch.
Replicating CUDA runtime and CUDA libraries is doable, however handwritten CUDA kernels pose a problem to the MocCUDA approach. We seek collaboration with CUDA experts and other interested parties to design automatic CUDA2CPU translation tools.
The new Braid project at ANL seeks to develop new methods for processing data from next-generation science experiments efficiently and reliably in a wide range of computing environments. When coupling machine learning with experimental data, tracking progress and validating queries becomes very difficult. We are developing a new provenance structure to capture the complex, versioned, and recursive structure of the input data that affects machine learning -based predictions. We will present the data model and architecture of this system in the context of Braid and describe four application engagements.
We have developed a prototype data model and implementation for Braid DB but invite discussion about other use cases, approaches, and experiences.
Distributed digital infrastructures for computation and analytics are now evolving towards an interconnected ecosystem allowing complex applications to be executed from IoT Edge devices to the HPC Cloud (aka the Computing Continuum, the Digital Continuum, or the Transcontinuum). Understanding end-to-end performance in such a complex continuum is challenging. In this presentation we will introduce E2Clab, a framework that implements a rigorous methodology that supports reproducibility and allows one to deploy real-life (e.g., Pl@ntNet botanical system) application workflows in the context of the Edge-to-Cloud Continuum on large-scale testbeds (e.g., Grid’5000). We will also highlight research challenges on optimizing the end-to-end deployment (e.g., minimizing processing latency, energy consumption, financial costs, etc.) of application workflows on distributed and heterogeneous resources in the Computing Continuum. The collaboration opportunities are in combining E2Clab + DeepHyper (ANL) to optimize the end-to-end application deployment.
Real-world applications deployed on the Computing Continuum (e.g., smart factory, autonomous vehicles, among others) typically need to comply with many constraints related to resource consumption (e.g., GPU, CPU, memory, storage and bandwidth capacities), software components composing the application and requirements such as QoS, security, and privacy. Furthermore, optimizing the end-to-end mapping and deployment (e.g., minimizing processing latency, energy consumption, financial costs, etc.) of application workflows on such distributed and heterogeneous resources is challenging. The parameter settings of the applications and the underlying infrastructure result in a myriad of configuration possibilities and, consequently, in a complex multi-infrastructure configuration search space. The intricacies of these configurations require an analysis in a controlled testbed environment in order to understand their performance trade-offs (i.e., latency and energy consumption, throughput and resource usage, cost and service availability, etc.), prior to the widespread deployment of such applications in production environments. The proposed approach aims to answer questions like: How to configure the system components distributed on Edge, Fog, and Cloud infrastructures to minimize the processing latency? Where application parts should be executed to minimize communication costs and end-to-end latency?
Collaboration opportunities: The idea is to apply DeepHyper (ANL) in a new domain (e.g., application workflows in the Computing Continuum). DeepHyper may be used to search the hyperparameters of models generated by E2Clab and then search the application configurations (explore the search space).
In this talk it will be shown how JSC uses numerical methods combined with machine learning (ML) approaches to improve diagnoses of pathologies in the human respiratory system. First, a data pipeline that uses computed tomography recordings of the human upper airway as input and automatically provides results of numerical flow simulations as output is presented. The results are analyzed to estimate the respiration capabilities, e.g., to evaluate the impact of narrowed channels on the supply of air for the lungs or how efficient the anatomy can heat up the air. The segmentation of air from other matter to define the region of interest and the detection of inflow and outflow boundaries that are required to setup a simulation make use of convolutional neural networks (CNNs). The automatization of the pipeline now allows to conduct a large number of simulations. The corresponding results will be used for data-driven techniques that accelerate and improve diagnoses. That is, CNNs will be used to intelligently initialize the flow field yielding simulation results faster and stabilizing the simulation. Furthermore, CNNs will be used to localize and classify pathologies and a reinforcement learning (RL) algorithm will be trained to modify the surface of the upper airway in a way that fluid mechanical properties are optimized. The algorithm will then be capable of suggesting steps for surgical interventions.
Open questions and collaboration opportunities refer to three topics. First, it would be helpful to have a discussion with partners that have experience in initializing flow fields with the help of ML techniques. Second, a discussion about the number of CT recordings needed for localizing and classifying pathologies is sought for. Third, it shall be discussed whether the RL works better with volumetric data or surface data (vertices and edges) as input to neural networks to estimate the environment. A general exchange with experts on the usage of ML in the field of CFD is welcome.
In this work, we propose a novel memory-driven high performance DNN training framework that leverages error-bounded lossy compression to significantly reduce the memory requirement for training in order to allow training larger neural networks. Different from the state-of-the-art solutions that adopt image-based lossy compressors such as JPEG to compress the activation data, our framework purposely designs error-bounded lossy compression with a strict error-controlling mechanism. Specifically, we provide theoretical analysis on the compression error propagation from the altered activation data to the gradients, and then empirically investigate the impact of altered gradients over the entire training process. Based on these analyses, we then propose an improved lossy compressor and an adaptive scheme to dynamically configure the lossy compression error-bound and adjust the training batch size to further utilize the saved memory space for additional speedup. We evaluate our design against state-of-the-art solutions with four widely-adopted DNNs and the ImageNet dataset.
Looking for more collaboration opportunities for memory lossy compression in deep learning and HPC applications
In this work, we will present our proposed optimization techniques for end-to-end performance of the error-bounded lossy compressor--cuSZ--on GPU-based HPC systems.
Looking for real-world large-scale HPC applications that require GPU-based lossy compression to reduce the I/O and storage burden
Scientific computation increasingly consists of a workflow of interrelated tasks. Containerization can make workflow systems more manageable, reproducible, and portable, but containers can impede communication due to their focus on encapsulation. In some circumstances, shared-memory regions are an effective way to improve performance of workflows; however sharing memory between containerized workflow tasks is difficult. Recently, we have created a software library called Dhmem that manages shared memory between workflow tasks in separate containers, with minimal code change and performance overhead. Instead of all code being in the same container, Dhmem allows a separate container for each workflow task to be constructed completely independently. Dhmem enables additional functionality: easy integration in existing workflow systems, communication configuration at runtime based on the environment, and scalable performance. In this talk, we present our Dhmem library and showcase some areas where it has been shown to improve performance, and some future opportunities enabled by the use of Dhmem. We hope that this talk will setup potential collaborations in the future.
Open questions include: how can Dhmem be used in situations where workflow tasks have very different lifetimes? How can Dhmem enable automatic and transparent access to data produced by the workflow? How can more workflow systems be supported by and integrated using Dhmem? Orcun Yildiz will be leading potential collaboration projects.
Earth system models (ESMs) have increased the spatial resolution to achieve more accurate solutions. As a consequence, the number of grid points increases dramatically, so an enormous amount of data is produced as simulation results. In addition, if ESMs manage to take advantage of the upcoming exascale computing power, their current data management system will become a bottleneck as the data production will grow exponentially.
The XML Input/Output Server (XIOS) is an MPI parallel I/O server designed for ESMs to efficiently post-process data inline as well as read and write data in NetCDF4 format. Although it offers a good performance in terms of computational efficiency for current resolutions, this could change for larger resolutions since the XIOS performance is very dependent on the output size. To address this problem we test the HDF5 compression in order to reduce the size of the data so that both I/O time and storage footprint can be improved. However, the default lossless compression filter of HDF5 does not provide a good trade-off between size reduction and computational cost.
Alternatively, we consider using lossy compression filters that may allow reaching high compression ratios and enough compression speed to considerably reduce the I/O time while keeping high accuracy. In particular, we are exploring the feasibility of using the SZ lossy compressor developed by the Argonne National Laboratory (ANL) to write highly compressed NetCDF files through XIOS. As a case study, the Open Integrated Forecast System (OpenIFS) is used, a weather forecast model that can use XIOS to output data.
Continuing the collaboration between ANL and BSC started the last JLESC workshop.
SZ - an error-bounded lossy compressor for scientific datasets - has experienced 5 years of its development, which has accumulated many successful stories. SZ has been used in many different use-cases, including reducing memory footprint, reducing storage footprint, accelerating computation, and improving I/O performance. In this talk, I will summarize the recent successful stories about SZ in past two years, which involve multiple scientific applications across different scientific domains, such as seismic wave simulation (reverse time migration), quantum circuit simulation, and quantum chemistry simulation. I will also present the updates of the SZ's latest progress and project potential collaboration opportunities.
Open question: how to use lossy compressor in applications (what specific requests from users)? how to assess the impact of lossy compressors to post hoc analysis?
Collaboration opportunities: SZ team will be more than happy to collaborate in working together to use or improve lossy compression for specific applications or various use-cases.
Large-scale applications and workflows in scientific domains such as weather forecast or cosmology have increasing I/O needs. However, the storage has been designed for decades as a global shared resource, making I/O one of the main bottleneck of HPC systems. To mitigate this congestion, new tiers (node-local SSDs, burst buffers, network-attached storage, and so on) have been added to recently deployed supercomputers increasing their complexity. Harnessing this additional storage capacity is an active research topic but little has been done about how to efficiently provisioning it.
Nevertheless, while for years high-performance computing (HPC) systems were the predominant means of meeting the requirements expressed by large-scale scientific workflows, today some components have moved away from supercomputers to Cloud-type infrastructures. From an I/O and storage perspective, the world of Cloud computing is very different from on-premise supercomputers: direct access to resources is extremely limited due to a very high level of abstraction.
Dealing with this high degree of heterogeneity distributed between two worlds with very different philosophies is a real challenge for scientific workflows and applications. In the project, we propose to explore ways to enable workflows to seamlessly use elastic storage systems on hybrid infrastructures combining HPC systems and Cloud. In particular, we want to focus on storage resource provisioning and on the questions it raises in terms of abstraction models and scheduling algorithms. With this talk, we provide an overview of our current research and potential collaborations within the JLESC.
We are open to collaborations on different topics from scheduling techniques adapted to storage resources, to hybrid HPC/Cloud storage abstratctions. We now have an active project on Chameleon for that purpose that will allow us to better understand Cloud storage and run experiments.
The Performance API (PAPI) enables programmers to monitor the performance of their applications via inspection of native hardware events. The instrumentation of a user’s code with PAPI requires the usage of event names. However, for meaningful performance monitoring, the user must know what event is indexed by a given event name.
The Counter Analysis Toolkit (CAT) contains a suite of micro-benchmarks, which systematically measure the occurrences of events by stressing different types of performance hardware. Using the event occurrence-pattern data, we aim to identify events pertaining to particular programming concepts of interest. In this talk, we examine the role of the discrete least squares regression in solving this problem.
The open questions for this talk are as follows:
- How can this data analysis be made more robust?
- What are other data analysis techniques which can be applied post-regression?
The EuroCC project is with over 30 partners funded by the EuroHPC Joint Undertaking as part of the EU framework programme for research and innovation Horizon2020. It aims at bringing the participating countries to a common high level in all HPC-related areas. Therefore, National Competence Centres are establisehd that map the competences of each partner country in HPC, HPDA, and AI. Besides filling identified gaps, the vision of the EuroCC project is to make HPC and its related fields more accessible not only to users from science, but especially to industry. This will be achieved by offering a portfolio of services tailored to the needs of the users, such as training and consulting.
The Jülich Supercomputing Centre (JSC) contributes to EuroCC as a member of the German National Competence Centre. More precisely, JSC is extending its Industry Relations Team that aims at establishing new industrial collaborations based on both commercial and third-party funded relations.
The current talk will give a brief overview first on the EuroCC project in general, and subsequently on the activities of JSC's Industry Relations Team.
General thoughts on the EuroCC project, discussion on experiences on industry cooperations (offered services, what defines success from the industry perspective and the HPC center perspective, how to identify partners that could benefit from HPC, especially SMEs).
We are currently developing tools to measure amount and location of active memory pages. The analysis results shall help with decisions whether moving certain hot allocations to HBM would be beneficial. We use the mainline linux kernel feature CONFIG_IDLE_PAGE_TRACKING (non-default option).
The sampling rate is currently limited by the amount of kernel calls needed to acquire information about all mapped pages of an application. Are there ways to improve measurement resolution?
Can other groups use this for deriving interesting properties of their applications, too?
Matching these measurements with other performance counters, execution traces, ... by integration with existing tools would be a collaboration opportunity.
The generalized minimum residual method (GMRES) is a commonly used solver for sparse, non-symmetric systems of linear equations. As a Krylov method, it's runtime is dominated by data movement. So, we look at a selective reduction in precision designed to improve performance by reducing data movement while still achieving a full-precision solution.
Open questions include:
* Can this approach be transferred to distributed systems, other GMRES-variants and other Krylov solvers.
* Are there other data-reduction techniques that can be used effectively?
Additionally, there may be collaboration opportunities in finding applications that benefit from mixed-precision GMRES.
Low-rank matrix approximation plays an important role in data analysis and scientific computing. With increase of data volumes, scalability becomes increasingly important, pushing for rethinking of classical approaches to data handling. In this brief talk, we discuss some of the recently proposed randomized algorithms that aim to provide scalable solution to low-rank matrix approximation problems, maintaining reasonable numerical performance. We aim to show that randomization does not always imply uncertainty, but rather presents a favorable tradeoff of small amount of accuracy to significant performance increase.
Produce a formal set of notes about modern practical randomized algorithms. Provide templates for each algorithm, for a general user to be aware of the underlying principles of methods (Similar to 'Barrett, R., M.W. Berry, T.F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romaine, and H. van der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, SIAM, Philadelphia, 1994.').
Produce a high quality randomized algorithms library to be proposed to applications domain experts and for a 'state of the art' purposes.
This work involves a verification method without directed rounding that has high portability for various computers. In particular, this talk focuses on eigenvalue problems. I will explain the verification method and illustrate numerical examples in terms of speed and accuracy.
In many cases, verification methods are tested using random matrices. Therefore, I want to collaborate with real applications that require accuracy or upper error bounds, and test the feasibility of verification to the applications.
Coupling asynchronous programming models (APMs) with MPI can be tedious and hard to do correctly. Even though MPI provides support for multiple threads and non-blocking communication operations, applications are still required to track active communications and map them to outstanding asynchronous activities in their respective programming model, polling MPI for completion in the meantime. This talk provides an overview of MPI Continuations, which provide a callback-based notification mechanism that can be used to simplify the coupling of APMs with MPI. By replacing the polling with a callback scheme it is possible to couple MPI with different APMs such as OpenMP, Argobots, and OmpSs, avoiding any tight integration on either side.
Looking for target applications using (task-based) asynchronous programming models and feedback on the current state of the design of the MPI continuations interface.
In this work, we consider the ExaGeoStat distributed application that computes the maximum likelihood estimation (MLE) for a given spatial data, which provides an efficient way to predict missing observations in the context of climate/weather forecasting applications. This application repeatedly computes a covariance matrix using the Matérn kernel followed by a Cholesky factorization. Thanks to the underlying task-based runtime (StarPU/Chameleon), both operations can be partially overlapped but although the later operation can efficiently exploit GPU, the former one is restricted to CPU, which raises interesting load-balancing and scheduling challenges. Improving the performance by adding more fat nodes comprising GPUs is both costly and inefficient. We show that the best performance is obtained by considering a heterogeneous set of nodes and by using both non trivial and distinct heterogeneous data distributions for the two phases and well adapted scheduling priorities.
Several parameters of our distribution and of the scheduling have been tailored for this application. To evaluate the generality of our approach, we are looking for other applications that may benefit from the techniques we propose.
In such iterative applications, how much can be learned automatically (in terms of workload and scheduling decisions) from past iterations and transferred to the next ?
This presentation centers on how two pragma-based programming models made to function completely independent and unaware of each other can be made to effectively collaborate with minimal additional programming intervention. This work entails the detection and modification of OpenACC pragmas by the OmpSs-2 compiler and the automatic managing of OpenACC asynchronous device queues by the OmpSs-2 runtime. We will go over the separation of duties between models and the in-depth description of the mechanism needed for interoperation. We will provide concrete code examples using ZPIC a plasma simulator application written in OmpSs-2, OpenACC, and OmpSs-2 + OpenACC. We will compare the performance and programmability benefits of OmpSs-2 + OpenACC ZPIC implementation against the other single-model implementations.
What abstractions are needed for combining different programming models? If abstractions can be well defined can they be used to combine any 2 or more models?
Collaboration through proting applications.
We provide a brief overview of our task-parallel programming library Eventify on GPUs. Eventify on GPUs offers diverse task queueing approaches that vary in memory layout, access scope, and synchronization mechanism. In this short talk, we present a selection of those queueing approaches and their influence on application runtime. First, we take a quick look at an initial task-pool approach with a single, global memory queue. Second, we consider a work-sharing approach with multiple task queues. Finally, we present an extension of the work-sharing approach with multiple, hierarchical task queues.
Collaboration opportunities include lock-free queueing approaches as well as message passing for multi-GPU and CPU-GPU tasking.
In this short talk we compare two different parallelization strategies for our Fast Multipole Method.
Using the traditional loop-based OpenMP approach on one hand and our own task-based approach Eventify on the other, we show their effectiveness and efficiency with respect to effort and scaling.
We discuss advantages and bottlenecks of both approaches and highlight how algorithm-specific knowledge helps to improve scalability.
Since our tasking approach is currently limited to shared memory only, we like to extend its applicability to distributed memory systems via message passing.
We present a method to manage the complexity of modern memory systems that is composed a portable and abstracted API to identify memory kinds describing hardware characteristics by metrics as bandwidth, latency, and capacity. It allows runtime system, parallel libraries, scientific applications select appropriately the memory to be used.
In addition, we present a survey using static code analysis, profiling and benchmarking to determine sensitivity of application buffers. And by combining these approaches with our API enabling a portable and productive method to match applications requirements and hardware memory characteristics.
Open Question.
Target capacity needs to be discussed when it conflict with affinities, for example when trying to allocate two 10 GB buffers on 16GB MCDRAM on KNL?
Collaboration opportunities:
It would be interesting to collaborate with people from Argonne working with AML which is a memory management library.
Title | Presenter | |
---|---|---|
Fugaku: the first 'Exascale' supercomputer | Prof. Satoshi Matsuoka Riken CCS | |
Abstract: Fugaku is the first ‘exascale’ supercomputer of the world, not due to its peak double precision flops, but rather, its demonstrated performance in real applications that were expected of exascale machines on their conceptions 10 years ago, as well as reaching actual exaflops in new breed of benchmarks such as HPL-AI. But the importance of Fugaku is \"applications first\” philosophy under which it was developed, and its resulting mission to be the centerpiece for rapid realization of the so-called Japanese ‘Society 5.0’ as defined by the Japanese S&T national policy. As such, Fugaku’s immense power is directly applicable not only to traditional scientific simulation applications, but can be a target of Society 5.0 applications that encompasses conversion of HPC & AI & Big Data as well as Cyber (IDC & Network) vs. Physical (IoT) space, with immediate societal impact with its technologies utilized as Cloud resources. In fact, Fugaku is already in partial operation a year ahead of schedule, primarily to obtain early Society 5.0 results including combatting COVID-19 as well as resolving other important societal issues and also go into full production in moments time. | ||
How a Community Software Ecosystem Perspective Helps to Advance Science Goals in the Exascale Computing Project | Dr. Lois Curfman McInnes Argonne National Laboratory | |
Abstract: Teams in the U.S. Exascale Computing Project (ECP) are working toward scientific advances on forthcoming exascale platforms, across a diverse suite of applications in chemistry, materials, energy, Earth and space science, data analytics, optimization, artificial intelligence, and national security. In turn, these applications build on software components, including programming models and runtimes, mathematical libraries, data and visualization packages, and development tools that comprise the Extreme-scale Scientific Software Stack (E4S). E4S represents a portfolio-driven effort to collect, test, and deliver the latest in reusable open-source HPC software products, as driven by the common needs of applications. E4S establishes product quality expectations and provides a portal as a starting point for access to product documentation. This presentation will discuss early experiences with how this software ecosystem approach delivers the latest advances from ECP software technology projects to applications, thereby helping to overcome software collaboration challenges across distributed aggregate teams. A key lesson learned is the need for close collaboration between teams developing applications and reusable software technologies, as well as the need for crosscutting strategies to increase developer productivity and software sustainability, thereby mitigating technical risks by building a firmer foundation for reproducible, sustainable science. | ||
High-Performance Deep Learning | Prof. Torsten Hoefler ETH Zurich | |
Abstract: Deep Learning is as computationally expensive as the most challenging scientific computing applications. In this talk, we outline the biggest challenges in training deep learning workloads and show how HPC techniques can be used to improve the performance of training workloads. We focus on model sparsity in the training process. This will be even more important once the scientific computing community uses deep learning in their workflows. | ||
A Story About Data: Advancing Storage, I/O and Processing at Challenging Scales | Dr. Gabriel Antoniu INRIA | |
Abstract: Looking back over more than 10 years of collaboration within JLESC involving Inria, the University of Illinois at Urbana-Champaign and Argonne National Lab, this talk will highlight a few achievements on hot topics related to data storage, I/O management and in situ visualisation and processing. From these initial challenges in this areas posed by the expected arrival of Exascale systems, new ones emerged as frontiers started to blur between High-Performance Computing and Big Data analytics. We will also discuss upcoming open problems triggered by the increasingly complex workflows that are mixing simulations, analytics and AI, which emphasize new requirements and opportunities created by their potential execution on the HPC/Cloud/Edge computing continuum. | ||
Resilience for Extreme Scale Computing | Dr. Leo Bautista BSC | |
Abstract: Resilience has been one of the main research topics of the JLESC since its conception over a decade ago. We have covered multiple types of failures and errors which has led to completely different fault tolerance techniques, some of them at the intersection of HPC and ML. The research work, carried out by JLESC researchers from five different institutions, shows a strong interaction between theoretical analysis and practical implementations. The results of this endeavor had led to multiple collaboration visits, dozens of publications and hundreds of citations; but more interestingly, it has opened new questions and it has shown connections between HPC fields that we didn't know were connected before. In this talk we will go over this trajectory and get a quick glance of what might come in the future for HPC resilience. | ||
Developer tools for porting & tuning parallel applications on extreme-scale systems | Dr. Brian Wylie JSC | |
Abstract: Application developers targeting extreme-scale HPC systems such as Fugaku, and modular supercomputing architectures such as JUWELS, need effective tools to assist with porting and tuning for these unusual systems. This collaborative project brings together developers of such tools from JLESC partners to investigate their integration and support joint training activities as the tools are deployed and applied to a variety of application codes. |
Satoshi Matsuoka from April 2018 has become the director of Riken CCS, the top-tier HPC center that represents HPC in Japan, developing and hosting Japan’s tier-one ‘Fugaku’ supercomputer which has become the fastest supercomputer in the world in all four major supercomputer rankings, along with multitudes of ongoing cutting edge HPC research being conducted, including investigating Post-Moore era computing.
He had been a Full Professor at the Global Scientific Information and Computing Center (GSIC), the Tokyo Institute of Technology since 2000, and the director of the joint AIST-Tokyo Tech. Real World Big Data Computing Open Innovation Laboratory (RWBC-OIL) since 2017, and became a Specially Appointed Professor at Tokyo Tech in 2018 along with his directorship at R-CCS.
He has been the leader of the TSUBAME series of supercomputers that have won many accolades such as world #1 in power-efficient computing. He also leads various major supercomputing research projects in areas such as parallel algorithms and programming, resilience, green computing, and convergence of big data/AI with HPC.
He has written over 500 articles according to Google Scholar, and chaired numerous ACM/IEEE conferences, including the Program Chair at the ACM/IEEE Supercomputing Conference (SC13) in 2013. He is a Fellow of the ACM and European ISC, and has won many awards, including the JSPS Prize from the Japan Society for Promotion of Science in 2006, presented by his Highness Prince Akishino; the ACM Gordon Bell Prize in 2011; the Commendation for Science and Technology by the Minister of Education, Culture, Sports, Science and Technology in 2012; the 2014 IEEE-CS Sidney Fernbach Memorial Award, the highest prestige in the field of HPC; HPDC 2018 Achievement Award from ACM; and recently SC Asia 2019 HPC Leadership Award.
Lois Curfman McInnes is a senior computational scientist in the Mathematics and Computer Science Division of Argonne National Laboratory. Her work focuses on high-performance computational science and engineering, with emphasis on scalable numerical libraries and community collaboration toward productive and sustainable software ecosystems. Lois has developed numerical algorithms and software for the parallel solution of large-scale scientific applications involving nonlinear partial differential equations and related optimization problems in the PETSc/TAO libraries. In February 2020, she began serving as Deputy Director of Software Technology in the US DOE Exascale Computing Project. Lois also co-leads the IDEAS project, whose members are partnering with the community to improve software productivity and sustainability as a key aspect of advancing overall scientific productivity. Previously, she served as the technical area lead of Mathematical Libraries in ECP’s Software Technology thrust and lead of the xSDK project—community collaboration to advance code quality, access, and interoperability while working toward a software ecosystem for high-performance numerical libraries. Lois is a SIAM Fellow. She won the 2015 SIAM/ACM Prize in Computational Science and Engineering and received an R&D 100 Award in 2009 (with collaborators) for work on the PETSc library; she also won an E.O. Lawrence Award in 2011 for outstanding contributions in research and development supporting DOE and its missions. She received a PhD in applied mathematics from the University of Virginia and a BS in mathematics and physics from Muhlenberg College. Lois served as Chair (2015–16) and Program Director (2013–14) of the SIAM Activity Group on Computational Science and Engineering, and she currently serves on the SIAM Council and the editorial board of SIAM News. More information: https://www.mcs.anl.gov/~mcinnes.
Torsten Hoefler directs the Scalable Parallel Computing Laboratory (SPCL) at D-INFK ETH Zurich. He received his PhD degree in 2007 at Indiana University and started his first professor appointment in 2011 at the University of Illinois at Urbana-Champaign.
Torsten has served as the lead for performance modeling and analysis in the US NSF Blue Waters project at NCSA/UIUC. Since 2013, he is professor of computer science at ETH Zurich and has held visiting positions at Argonne National Laboratories, Sandia National Laboratories, and Microsoft Research Redmond (Station Q).
Dr. Hoefler's research aims at understanding the performance of parallel computing systems ranging from parallel computer architecture through parallel programming to parallel algorithms. He is also active in the application areas of Weather and Climate simulations as well as Machine Learning with a focus on Distributed Deep Learning. In those areas, he has coordinated tens of funded projects and an ERC Starting Grant on Data-Centric Parallel Programming.
He has been chair of the Hot Interconnects conference and technical program chair of the Supercomputing and ACM PASC conferences. He is associate editor of the IEEE Transactions of Parallel and Distributed Computing (TPDS) and the Parallel Computing Journal (PARCO) and a key member of the Message Passing Interface (MPI) Forum.
He has published more than 200 papers in peer-reviewed international conferences and journals and co-authored the latest versions of the MPI specification. He has received best paper awards at the ACM/IEEE Supercomputing Conference in 2010, 2013, and 2014 (SC10, SC13, SC14), EuroMPI 2013, IPDPS'15, ACM HPDC'15 and HPDC'16, ACM OOPSLA'16, and other conferences. Torsten received ETH Zurich's Latsis Prize in 2015, the SIAM SIAG/Supercomputing Junior Scientist Prize in 2012, the IEEE TCSC Young Achievers in Scalable Computing Award in 2013, the Young Alumni Award 2014 from Indiana University, and the best student award 2005 of the Chemnitz University of Technology. Torsten was elected into the first steering committee of ACM's SIGHPC in 2013 and he was re-elected in 2016.
Gabriel Antoniu is a Senior Research Scientist at Inria, Rennes, where he leads the KerData research team. His recent research interests include scalable storage, I/O and in situ visualization, data processing architectures favoring the convergence of HPC, Big Data analytics and AI. He has served as a PI for several international projects in these areas in partnership with Microsoft Research, IBM, ATOS/BULL, Argonne National Lab, the University of Illinois at Urbana Champaign, Universidad Politécnica de Madrid, Barcelona Supercomputing Center. He served as Program Chair for the IEEE Cluster conference in 2014 and 2017 and regularly serves as a PC member of major conferences in the area of HPC, cloud computing and Big Data analytics (SC, HPDC, CCGRID, Cluster, Big Data, etc.). He co-authored over 150 international publications in the aforementioned areas. More information: https://team.inria.fr/kerdata/gabriel-antoniu/.
Dr. Leonardo Bautista-Gomez is an expert in reliability for high-performance computing (HPC). He is a Senior Research Scientist at the Barcelona Supercomputing Center (BSC) where he works in multiple H2020 European projects related to resilience, energy efficiency and multilevel storage systems for HPC. He is currently involved in multiple European projects, leading a team of Ph.D. students and engineers on the development of resilience techniques for large scale energy-efficient heterogeneous systems and machine learning workflows.
In 2016 he was awarded with a European Marie-Sklodowska Curie Actions fellowship on Deep-memory Ubiquity, Resilience and Optimization. In addition, he was awarded the 2016 IEEE TCSC Award for Excellence in Scalable Computing, Early Career Researcher. Before moving to BSC he was a Postdoctoral researcher at the Argonne National Laboratory, where he investigated data corruption detection techniques and error propagation. Most of this research was focused on how errors affect scientific applications and how to detect silent corruption using machine learning tools.
Leonardo did his PhD. in resilience for supercomputers at the Tokyo Institute of Technology. He developed a scalable multilevel checkpointing library called Fault Tolerance Interface (FTI) to guarantee application resilience at extreme scale. For this work, he was awarded the 2011 ACM/IEEE George Michael Memorial High Performance Computing Ph.D. Fellow at Supercomputing Conference 2011 (SC11), Honorable Mention. Furthermore, he got a Special Certificate of Recognition for achieving a perfect score at SC11 for the paper : “FTI : High-Performance Fault Tolerance Interface for Hybrid Systems”. Moreover, he was awarded the Japanese Society for the Promotion of Science (JSPS), Research Fellowships for Young Scientists, Doctoral Course.
Before moving to Tokyo Tech., he graduated in Master for Distributed Systems and Applications from the Paris 6 University, Pierre & Marie Curie. Prior to this, he obtained a Bachelor in Computer Science from the Paris 6 University, Pierre & Marie Curie.
Brian J. N. Wylie has been a scientific researcher in Juelich Supercomputing Centre since 2004, in the group developing the Scalasca toolset for scalable performance analysis of large-scale parallel applications. He established and continues to contribute to tools training activities of the Virtual Institute -- High Productivity Supercomputing (VI-HPS). His current focus is the assessment of exascale readiness of applications comprising very large numbers of processes and threads within the Performance Optimisation and Productivity (POP) EU Centre of Excellence in HPC.
This works will present a preliminary design and result of a Vivado HLS implemented lossy compression for Xilinx FPGAs based on updated SZ algorithm.
Looking for new use scenarios for FPGA/ASIC-based SZ lossy compression
This work discusses use cases of state preservation, which generalizes checkpointing into productive use cases (in addition to resilience). It focuses in particular on data management for deep learning applications, for which the state is either pre-processed training samples or partially trained models.
How deep learning application can use model checkpoints and dependencies between them to build better training strategies? How can caching of training data reduce sampling overhead? How can add replay support to streams of training data to avoid catastrophic forgetting?
Non-deterministic communication is an important part of ensuring swift parallel execution of MPI applications, but can result in unpredictable, sometimes problematic, application behaviors. Specifically, non-determinism in executions can cause non-reproducible bugs and numerical non-reproducibility. Gathering adequate data to characterize non-deterministic behaviors in a given application is often costly at a minimum, and seemingly impossible at worst. While tools exist to assist in these efforts, there are limitations. For example, some tools focus entirely upon catastrophic bugs caused by non-determinism, while others, especially record-and-replay tools, enforce a particular execution pattern upon an application to guarantee reproducibility of a result. There is inadequate understanding of root sources of non-determinism for broader classes of bugs and scientific applications.
We present a software framework for identifying root sources of non-deterministic behavior in MPI applications and quantifying the application’s degree of non-determinism at different scales through the lens of graph similarity. Specifically, we devise a novel extension of event graph modeling in which we transform graph kernels to become a quantitative proxy for communication non-determinism of MPI applications. Our framework supports codes integrating four diverse MPI communication patterns: the simple but ubiquitous message race, the receiver-side and sender-side non-deterministic patterns in the Algebraic Multigrid 2013 Benchmark (AMG2013) [1], the non-blocking MPI 2-dimensional grid communication pattern (MCB Grid) in the Monte Carlo Benchmark [2], and the Unstructured Mesh pattern [3] with its randomized process topology. We show results for a message race benchmark at different scales. Our workflow has the potential to reduce costs of debugging and justify the soundness of MPI application results, even in the presence of numerical irreproducibility.
[1] J. Park, M. Smelyanskiy, U. M. Yang, D. Mudigere, and P. Dubey, 'High-performance algebraic multigrid solver optimized for multi-core based distributed parallel systems,' in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2015, pp. 1–12.
[2] N. Gentile and B. Miller, 'Monte carlo benchmark (MCB),' https://computing.llnl.gov/projects/co-design/mcb, LLNL-CODE-507091.
[3] N. Jain and A. Bhatele, 'Chatterbug communication proxy applications suite,' https://github.com/LLNL/chatterbug, LLNL-CODE-756471.
We are looking for collaborators who have MPI applications (with point-to-point communications) and non-deterministic behaviour to serve as testing use cases for our framework. We have used the framework for miniAMR and MCB runs.
We are interested in collaborating with colleagues who are using graph theory or other methods to identify aspects associated with non-determinism such as bugs, differences in results, or other reasons.
We are interested in the state of MPI noise injection moving into QMPI. In the past, various noise injection techniques have existed, such as NINJA and Jitterbug. We want to use noise injection to study large scale communication nondeterminism with noise-injected smaller scale tests as a proxy. Tools that are built with QMPI in mind are needed to support tool stacks in future MPI implementations.
Microsoft, Google and other large cloud companies have been developing their own custom chips in-house. Such custom hardware movement is pushed by the end of the Moore's law. Being said that new custom hardware development still requires serious efforts. The good news is that the advent of open-source hardware movement and open-source friendly semiconductor industry seems to change the perception of hardware development gradually and may eventually create a true co-design playground where software and hardware experts work together. In the meanwhile, I had an opportunity to be involved in Argonne's next-gen X-ray detector ASIC project. I give a brief summary of custom hardware movement and our recently-started X-ray detector ASIC project and compressor designs as well as a new hardware design methodology using Chisel, Scala-based hardware construction language.
One question is how can hardware customization help our work? Other question is how can we leverage open-source hardware movement and ecosystem? FPGAs allow us to create custom hardware in exchange of the overhead from reconfigurability and demonstrated their performance on certain applications. Application-specific integrated circuits (ASICs) offer significantly better performance than FPGAs on similar logic. The development cost of custom ASIC may still be prohibitable. A question is what will happen if the cost of ASIC development becomes cheap enough? A technology like structured ASICs (a write-once FPGAs) offers higher performance than FPGAs and lower cost than ASICs. Technologies are available. A question is how can we translate our software algorithm into hardware?
In terms of opportunists, studying hardware algorithms and translation mechanisms for our workloads is the most straightforward topic. Another topic is pre-processing of data from physics experiment to HPC systems, where data movement is the biggest issue. How can we manage data in a holistic view? (realtime loss-less compression at edge, lossy compression at HPC, encryption, etc). There could be opportunities for realtime AI inference at in/near detector. Lastly, studying hardware tools (e.g., place and route) and design tools such as Chisel themselves is also an interesting topic (dataflow, functional programming, scalability, optimization algorithms, etc).
New research ideas require an instrument where they can be developed, tested – and shared. To support Computer Science research such instrument has to provide access to a diversity of hardware configurations, support deployment at scale, and deep reconfigrability so that a wide range of experiments can be supported. It also has to provide mechanisms for sharing so that new results can multiply their value by triggering further innovation. Most importantly -- since science does not stand still – such instrument requires constant adaptation to support an ever increasing range of experiments driven by emergent ideas and opportunities.
The NSF-funded deeply reconfigurable Chameleon testbed for Computer Science research and education (www.chameleoncloud.org) has been developed to provide all those capabilities. The testbed provides many thousands of cores and over 5PB of storage hosted at sites at the University of Chicago and the Texas Advanced Computing Center (TACC) connected by 100 Gbps network. The hardware consists of a large homogenous partitions to facilitate experiments at scale along an investment in diversity consisting of a range of accelerators, storage hierarchy nodes with a mix of HDDs, SDDs, NVMe, and large RAM, high-bandwidth I/0 storage, SDN-enabled networking hardware, and fast interconnects. To support Computer Science experiments, ranging from operating system and virtualization research Chameleon provides a configuration system giving users full control of the software stack: provisioning of bare metal, reboot, power on/off and console access. To date, the testbed has supported over 5,000 users and almost 700 research and education projects and has just been renewed for four more years of operation.
This talk will provide an update on the design strategy, and the existing and future capabilities of the testbed, as well as some of the research and education projects our users are working on. I will also introduce the services and tools we created to support sharing of experiments, curricula, and other digitally expressed artifacts that allow science to be shared via active involvement and foster reproducibility. Finally, I will describe how Chameleon is adapting to support new research directions and hope to discuss how we can most effectively move to support future research together.
Some open questions: what capabilities/hardware/scientific instrument features are needed to support the research of JLESC participants? How can we best collaborate with the ecosystem of testbeds growing out of the original Grid5000 testbed? What new research directions should we support?
The ever-increasing number of computation units assembled in current HPC platforms leads to a concerning increase in fault probability. Traditional checkpoint/restart strategies avoid wasting large amounts of computation time when such fault occurs. With the increasing amount of data processed by today's applications, these strategies, however, suffer from their data transfer demand becoming unreasonable, or the entailed global synchronizations.
The current trend towards task-based programming is an opportunity to revisit the principles of the checkpoint/restart strategies. We propose a checkpointing scheme which is closely tied to the execution of task graphs. We describe how it allows for completely asynchronous and distributed checkpointing, as well as localized node restart, thus allowing for very large scalability. We also show how a synergy between the application data transfers and the checkpointing transfers can lead to a reasonable additional network load, measured to be lower than +10% on a dense linear algebra example.
This work is a use case of ULFM, we are open to collaborate to improve our work and also to provide StarPu for application developper who wants to try this runtime. Tapioca seems to be an interesting way to implement a multi-level checkpointing policy, which is lacking in our current work.
We developed a parallel library to accelerate programs where multiple parallel workers update a shared resource, and sometimes concurrently try to update the same location. If the conflict rate is small, our library outperforms many alternatives. This can be used for graph algorithms, sparse matrix operations, back-propagation, etc.
While our initial idea appears promising and was published already, the library can be extended with new acceleration strategies, implementations for other hardware (we currently only deal with multi-core CPUs, not GPUs etc), and we are looking for more applications that could benefit from our work.
Neural network (NN) models are increasingly utilized by scientific simulations to extract knowledge from datasets. Tailoring NNs to specific scientific datasets is time-consuming and error-prone. Neural architecture search (NAS) automates the design of NN architectures. NAS attempts to find well-performing NN models for specialized datasets, where performance is measured by key metrics that capture NN capabilities. However, existing NAS methods are resource intensive, especially when searching for highly accurate models for large datasets. To address this problem, we present PEng4NN, a performance estimation engine that predicts neural network performance early in training. PEng4NN plugs into existing neural architecture search (NAS) methods; it predicts the final accuracy of the NNs that the NAS selects from the search space and reports these predictions to the NAS, enabling early training termination of the NNs in the NAS. PEng4NN has exhibited on average savings of 60% - 80% of training epochs needed in the NAS, which enables a throughput gain of 2.5 to 5 times. This enables the NAS to use fewer resources, explore more architectures, or explore a larger search space.
A. Keller Rorabaugh, S. Caino-Lores, M. R. Wyatt II, T. Johnston, M. Taufer 'PEng4NN: An Accurate Performance Estimation Engine for Efficient Automated Neural Network Architecture Search' https://arxiv.org/abs/2101.04185, 2021.
We seek collaborations with researchers in need of NNs tailored to their scientific datasets, where a NAS augmented by PEng4NN can be utilized.
We want to test PEng4NN on a broad range of NAS applications. To that end, we seek collaboration with any NAS developers interested in increasing the efficiency of their NAS.
Currently, PEng4NN develops a predictive model for accuracy. In the future, we plan to extend PEng4NN to predict loss as well and would be pleased to collaborate with researchers who are interested in modeling loss curves.
A protein’s structure determines its function. Different proteins have different structures; proteins in the same family share similar substructures and thus may share similar functions. Additionally, one protein may exhibit several structural states, also named conformations. Identifying different proteins and their conformations can help solve problems such as determining the cause of diseases and designing drugs. X-ray Free Electron Laser (XFEL) beams are used to create diffraction patterns (images) that can reveal protein structure and function. The translation from diffraction patterns in the XFEL images to protein structures and functionalities is nontrivial.
We present our XPSI (XFEL-based Protein Structure Identifier) that relies on ML methods (autoencoder and kNN) to capture key information that allows the identification of properties, such as spatial orientation and protein conformation from the diffraction patterns. In our previous talk, we explored data with two resolutions for one protein with two conformations. For this edition, we show the expansion of our framework for more complex protein diffraction imaging datasets: three imaging resolutions, two proteins with four conformations each, and the addition of a 3rd rotational angle for spatial orientation. We quantify the classification accuracy and performance of XPSI, obtaining an orientation prediction error up to 10 degrees and two-conformation accuracy prediction of 95%.
The addition of imaging factors in our experimental setup, such as resolution, rotation, and symmetric proteins, increases the complexity of our data and prediction task. Therefore, employing new techniques such as VAEs (Variational autoencoders) or end-to-end deep learning frameworks are promising next steps for our framework.
This project is collaborative research between RIKEN, GCLab, and ICL.
What is the suitability of the current framework for real-world cases (e.g., 3-D reconstruction of protein in general and ribosome in particular, annotation of images for their classification without human intervention) ?
What is the range of errors (i.e., distribution of orientation and/or conformation) tolerable for scientists?
What are the costs in terms of computational requirements (execution time and resources) for other methods that identify proteins’ properties used by the scientists and how does our framework compare?
Can we use systematic data from simulations to train the framework and use the trained models for classification/prediction of real experiment data successfully?
How can we augment the framework to also classify / predict protein structures (currently the framework captures orientations and conformations)?
What aspects from the science, such as, rotational orientation, beam resolution, symmetrical proteins, can presumably limit the functionality of the framework?
We present a framework to natively execute CUDA code on CPU. Porting existing GPU codes to Fugaku is tedious, especially when it comes to DL (e.g. Tensorflow and Pytorch). Hence, we explore an alternative approach and present the application a vitual GPU and re-implementations of CUDA libraries for CPU, to execute those CUDA applications natively. We demonstrate a PoC of our idea and show early results of running Resnet50 with pytorch.
Replicating CUDA runtime and CUDA libraries is doable, however handwritten CUDA kernels pose a problem to the MocCUDA approach. We seek collaboration with CUDA experts and other interested parties to design automatic CUDA2CPU translation tools.
The new Braid project at ANL seeks to develop new methods for processing data from next-generation science experiments efficiently and reliably in a wide range of computing environments. When coupling machine learning with experimental data, tracking progress and validating queries becomes very difficult. We are developing a new provenance structure to capture the complex, versioned, and recursive structure of the input data that affects machine learning -based predictions. We will present the data model and architecture of this system in the context of Braid and describe four application engagements.
We have developed a prototype data model and implementation for Braid DB but invite discussion about other use cases, approaches, and experiences.
Distributed digital infrastructures for computation and analytics are now evolving towards an interconnected ecosystem allowing complex applications to be executed from IoT Edge devices to the HPC Cloud (aka the Computing Continuum, the Digital Continuum, or the Transcontinuum). Understanding end-to-end performance in such a complex continuum is challenging. In this presentation we will introduce E2Clab, a framework that implements a rigorous methodology that supports reproducibility and allows one to deploy real-life (e.g., Pl@ntNet botanical system) application workflows in the context of the Edge-to-Cloud Continuum on large-scale testbeds (e.g., Grid’5000). We will also highlight research challenges on optimizing the end-to-end deployment (e.g., minimizing processing latency, energy consumption, financial costs, etc.) of application workflows on distributed and heterogeneous resources in the Computing Continuum. The collaboration opportunities are in combining E2Clab + DeepHyper (ANL) to optimize the end-to-end application deployment.
Real-world applications deployed on the Computing Continuum (e.g., smart factory, autonomous vehicles, among others) typically need to comply with many constraints related to resource consumption (e.g., GPU, CPU, memory, storage and bandwidth capacities), software components composing the application and requirements such as QoS, security, and privacy. Furthermore, optimizing the end-to-end mapping and deployment (e.g., minimizing processing latency, energy consumption, financial costs, etc.) of application workflows on such distributed and heterogeneous resources is challenging. The parameter settings of the applications and the underlying infrastructure result in a myriad of configuration possibilities and, consequently, in a complex multi-infrastructure configuration search space. The intricacies of these configurations require an analysis in a controlled testbed environment in order to understand their performance trade-offs (i.e., latency and energy consumption, throughput and resource usage, cost and service availability, etc.), prior to the widespread deployment of such applications in production environments. The proposed approach aims to answer questions like: How to configure the system components distributed on Edge, Fog, and Cloud infrastructures to minimize the processing latency? Where application parts should be executed to minimize communication costs and end-to-end latency?
Collaboration opportunities: The idea is to apply DeepHyper (ANL) in a new domain (e.g., application workflows in the Computing Continuum). DeepHyper may be used to search the hyperparameters of models generated by E2Clab and then search the application configurations (explore the search space).
In this talk it will be shown how JSC uses numerical methods combined with machine learning (ML) approaches to improve diagnoses of pathologies in the human respiratory system. First, a data pipeline that uses computed tomography recordings of the human upper airway as input and automatically provides results of numerical flow simulations as output is presented. The results are analyzed to estimate the respiration capabilities, e.g., to evaluate the impact of narrowed channels on the supply of air for the lungs or how efficient the anatomy can heat up the air. The segmentation of air from other matter to define the region of interest and the detection of inflow and outflow boundaries that are required to setup a simulation make use of convolutional neural networks (CNNs). The automatization of the pipeline now allows to conduct a large number of simulations. The corresponding results will be used for data-driven techniques that accelerate and improve diagnoses. That is, CNNs will be used to intelligently initialize the flow field yielding simulation results faster and stabilizing the simulation. Furthermore, CNNs will be used to localize and classify pathologies and a reinforcement learning (RL) algorithm will be trained to modify the surface of the upper airway in a way that fluid mechanical properties are optimized. The algorithm will then be capable of suggesting steps for surgical interventions.
Open questions and collaboration opportunities refer to three topics. First, it would be helpful to have a discussion with partners that have experience in initializing flow fields with the help of ML techniques. Second, a discussion about the number of CT recordings needed for localizing and classifying pathologies is sought for. Third, it shall be discussed whether the RL works better with volumetric data or surface data (vertices and edges) as input to neural networks to estimate the environment. A general exchange with experts on the usage of ML in the field of CFD is welcome.
In this work, we propose a novel memory-driven high performance DNN training framework that leverages error-bounded lossy compression to significantly reduce the memory requirement for training in order to allow training larger neural networks. Different from the state-of-the-art solutions that adopt image-based lossy compressors such as JPEG to compress the activation data, our framework purposely designs error-bounded lossy compression with a strict error-controlling mechanism. Specifically, we provide theoretical analysis on the compression error propagation from the altered activation data to the gradients, and then empirically investigate the impact of altered gradients over the entire training process. Based on these analyses, we then propose an improved lossy compressor and an adaptive scheme to dynamically configure the lossy compression error-bound and adjust the training batch size to further utilize the saved memory space for additional speedup. We evaluate our design against state-of-the-art solutions with four widely-adopted DNNs and the ImageNet dataset.
Looking for more collaboration opportunities for memory lossy compression in deep learning and HPC applications
In this work, we will present our proposed optimization techniques for end-to-end performance of the error-bounded lossy compressor--cuSZ--on GPU-based HPC systems.
Looking for real-world large-scale HPC applications that require GPU-based lossy compression to reduce the I/O and storage burden
Scientific computation increasingly consists of a workflow of interrelated tasks. Containerization can make workflow systems more manageable, reproducible, and portable, but containers can impede communication due to their focus on encapsulation. In some circumstances, shared-memory regions are an effective way to improve performance of workflows; however sharing memory between containerized workflow tasks is difficult. Recently, we have created a software library called Dhmem that manages shared memory between workflow tasks in separate containers, with minimal code change and performance overhead. Instead of all code being in the same container, Dhmem allows a separate container for each workflow task to be constructed completely independently. Dhmem enables additional functionality: easy integration in existing workflow systems, communication configuration at runtime based on the environment, and scalable performance. In this talk, we present our Dhmem library and showcase some areas where it has been shown to improve performance, and some future opportunities enabled by the use of Dhmem. We hope that this talk will setup potential collaborations in the future.
Open questions include: how can Dhmem be used in situations where workflow tasks have very different lifetimes? How can Dhmem enable automatic and transparent access to data produced by the workflow? How can more workflow systems be supported by and integrated using Dhmem? Orcun Yildiz will be leading potential collaboration projects.
Earth system models (ESMs) have increased the spatial resolution to achieve more accurate solutions. As a consequence, the number of grid points increases dramatically, so an enormous amount of data is produced as simulation results. In addition, if ESMs manage to take advantage of the upcoming exascale computing power, their current data management system will become a bottleneck as the data production will grow exponentially.
The XML Input/Output Server (XIOS) is an MPI parallel I/O server designed for ESMs to efficiently post-process data inline as well as read and write data in NetCDF4 format. Although it offers a good performance in terms of computational efficiency for current resolutions, this could change for larger resolutions since the XIOS performance is very dependent on the output size. To address this problem we test the HDF5 compression in order to reduce the size of the data so that both I/O time and storage footprint can be improved. However, the default lossless compression filter of HDF5 does not provide a good trade-off between size reduction and computational cost.
Alternatively, we consider using lossy compression filters that may allow reaching high compression ratios and enough compression speed to considerably reduce the I/O time while keeping high accuracy. In particular, we are exploring the feasibility of using the SZ lossy compressor developed by the Argonne National Laboratory (ANL) to write highly compressed NetCDF files through XIOS. As a case study, the Open Integrated Forecast System (OpenIFS) is used, a weather forecast model that can use XIOS to output data.
Continuing the collaboration between ANL and BSC started the last JLESC workshop.
SZ - an error-bounded lossy compressor for scientific datasets - has experienced 5 years of its development, which has accumulated many successful stories. SZ has been used in many different use-cases, including reducing memory footprint, reducing storage footprint, accelerating computation, and improving I/O performance. In this talk, I will summarize the recent successful stories about SZ in past two years, which involve multiple scientific applications across different scientific domains, such as seismic wave simulation (reverse time migration), quantum circuit simulation, and quantum chemistry simulation. I will also present the updates of the SZ's latest progress and project potential collaboration opportunities.
Open question: how to use lossy compressor in applications (what specific requests from users)? how to assess the impact of lossy compressors to post hoc analysis?
Collaboration opportunities: SZ team will be more than happy to collaborate in working together to use or improve lossy compression for specific applications or various use-cases.
Large-scale applications and workflows in scientific domains such as weather forecast or cosmology have increasing I/O needs. However, the storage has been designed for decades as a global shared resource, making I/O one of the main bottleneck of HPC systems. To mitigate this congestion, new tiers (node-local SSDs, burst buffers, network-attached storage, and so on) have been added to recently deployed supercomputers increasing their complexity. Harnessing this additional storage capacity is an active research topic but little has been done about how to efficiently provisioning it.
Nevertheless, while for years high-performance computing (HPC) systems were the predominant means of meeting the requirements expressed by large-scale scientific workflows, today some components have moved away from supercomputers to Cloud-type infrastructures. From an I/O and storage perspective, the world of Cloud computing is very different from on-premise supercomputers: direct access to resources is extremely limited due to a very high level of abstraction.
Dealing with this high degree of heterogeneity distributed between two worlds with very different philosophies is a real challenge for scientific workflows and applications. In the project, we propose to explore ways to enable workflows to seamlessly use elastic storage systems on hybrid infrastructures combining HPC systems and Cloud. In particular, we want to focus on storage resource provisioning and on the questions it raises in terms of abstraction models and scheduling algorithms. With this talk, we provide an overview of our current research and potential collaborations within the JLESC.
We are open to collaborations on different topics from scheduling techniques adapted to storage resources, to hybrid HPC/Cloud storage abstratctions. We now have an active project on Chameleon for that purpose that will allow us to better understand Cloud storage and run experiments.
The Performance API (PAPI) enables programmers to monitor the performance of their applications via inspection of native hardware events. The instrumentation of a user’s code with PAPI requires the usage of event names. However, for meaningful performance monitoring, the user must know what event is indexed by a given event name.
The Counter Analysis Toolkit (CAT) contains a suite of micro-benchmarks, which systematically measure the occurrences of events by stressing different types of performance hardware. Using the event occurrence-pattern data, we aim to identify events pertaining to particular programming concepts of interest. In this talk, we examine the role of the discrete least squares regression in solving this problem.
The open questions for this talk are as follows:
- How can this data analysis be made more robust?
- What are other data analysis techniques which can be applied post-regression?
The EuroCC project is with over 30 partners funded by the EuroHPC Joint Undertaking as part of the EU framework programme for research and innovation Horizon2020. It aims at bringing the participating countries to a common high level in all HPC-related areas. Therefore, National Competence Centres are establisehd that map the competences of each partner country in HPC, HPDA, and AI. Besides filling identified gaps, the vision of the EuroCC project is to make HPC and its related fields more accessible not only to users from science, but especially to industry. This will be achieved by offering a portfolio of services tailored to the needs of the users, such as training and consulting.
The Jülich Supercomputing Centre (JSC) contributes to EuroCC as a member of the German National Competence Centre. More precisely, JSC is extending its Industry Relations Team that aims at establishing new industrial collaborations based on both commercial and third-party funded relations.
The current talk will give a brief overview first on the EuroCC project in general, and subsequently on the activities of JSC's Industry Relations Team.
General thoughts on the EuroCC project, discussion on experiences on industry cooperations (offered services, what defines success from the industry perspective and the HPC center perspective, how to identify partners that could benefit from HPC, especially SMEs).
We are currently developing tools to measure amount and location of active memory pages. The analysis results shall help with decisions whether moving certain hot allocations to HBM would be beneficial. We use the mainline linux kernel feature CONFIG_IDLE_PAGE_TRACKING (non-default option).
The sampling rate is currently limited by the amount of kernel calls needed to acquire information about all mapped pages of an application. Are there ways to improve measurement resolution?
Can other groups use this for deriving interesting properties of their applications, too?
Matching these measurements with other performance counters, execution traces, ... by integration with existing tools would be a collaboration opportunity.
The generalized minimum residual method (GMRES) is a commonly used solver for sparse, non-symmetric systems of linear equations. As a Krylov method, it's runtime is dominated by data movement. So, we look at a selective reduction in precision designed to improve performance by reducing data movement while still achieving a full-precision solution.
Open questions include:
* Can this approach be transferred to distributed systems, other GMRES-variants and other Krylov solvers.
* Are there other data-reduction techniques that can be used effectively?
Additionally, there may be collaboration opportunities in finding applications that benefit from mixed-precision GMRES.
Low-rank matrix approximation plays an important role in data analysis and scientific computing. With increase of data volumes, scalability becomes increasingly important, pushing for rethinking of classical approaches to data handling. In this brief talk, we discuss some of the recently proposed randomized algorithms that aim to provide scalable solution to low-rank matrix approximation problems, maintaining reasonable numerical performance. We aim to show that randomization does not always imply uncertainty, but rather presents a favorable tradeoff of small amount of accuracy to significant performance increase.
Produce a formal set of notes about modern practical randomized algorithms. Provide templates for each algorithm, for a general user to be aware of the underlying principles of methods (Similar to 'Barrett, R., M.W. Berry, T.F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romaine, and H. van der Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, SIAM, Philadelphia, 1994.').
Produce a high quality randomized algorithms library to be proposed to applications domain experts and for a 'state of the art' purposes.
This work involves a verification method without directed rounding that has high portability for various computers. In particular, this talk focuses on eigenvalue problems. I will explain the verification method and illustrate numerical examples in terms of speed and accuracy.
In many cases, verification methods are tested using random matrices. Therefore, I want to collaborate with real applications that require accuracy or upper error bounds, and test the feasibility of verification to the applications.
Coupling asynchronous programming models (APMs) with MPI can be tedious and hard to do correctly. Even though MPI provides support for multiple threads and non-blocking communication operations, applications are still required to track active communications and map them to outstanding asynchronous activities in their respective programming model, polling MPI for completion in the meantime. This talk provides an overview of MPI Continuations, which provide a callback-based notification mechanism that can be used to simplify the coupling of APMs with MPI. By replacing the polling with a callback scheme it is possible to couple MPI with different APMs such as OpenMP, Argobots, and OmpSs, avoiding any tight integration on either side.
Looking for target applications using (task-based) asynchronous programming models and feedback on the current state of the design of the MPI continuations interface.
In this work, we consider the ExaGeoStat distributed application that computes the maximum likelihood estimation (MLE) for a given spatial data, which provides an efficient way to predict missing observations in the context of climate/weather forecasting applications. This application repeatedly computes a covariance matrix using the Matérn kernel followed by a Cholesky factorization. Thanks to the underlying task-based runtime (StarPU/Chameleon), both operations can be partially overlapped but although the later operation can efficiently exploit GPU, the former one is restricted to CPU, which raises interesting load-balancing and scheduling challenges. Improving the performance by adding more fat nodes comprising GPUs is both costly and inefficient. We show that the best performance is obtained by considering a heterogeneous set of nodes and by using both non trivial and distinct heterogeneous data distributions for the two phases and well adapted scheduling priorities.
Several parameters of our distribution and of the scheduling have been tailored for this application. To evaluate the generality of our approach, we are looking for other applications that may benefit from the techniques we propose.
In such iterative applications, how much can be learned automatically (in terms of workload and scheduling decisions) from past iterations and transferred to the next ?
This presentation centers on how two pragma-based programming models made to function completely independent and unaware of each other can be made to effectively collaborate with minimal additional programming intervention. This work entails the detection and modification of OpenACC pragmas by the OmpSs-2 compiler and the automatic managing of OpenACC asynchronous device queues by the OmpSs-2 runtime. We will go over the separation of duties between models and the in-depth description of the mechanism needed for interoperation. We will provide concrete code examples using ZPIC a plasma simulator application written in OmpSs-2, OpenACC, and OmpSs-2 + OpenACC. We will compare the performance and programmability benefits of OmpSs-2 + OpenACC ZPIC implementation against the other single-model implementations.
What abstractions are needed for combining different programming models? If abstractions can be well defined can they be used to combine any 2 or more models?
Collaboration through proting applications.
We provide a brief overview of our task-parallel programming library Eventify on GPUs. Eventify on GPUs offers diverse task queueing approaches that vary in memory layout, access scope, and synchronization mechanism. In this short talk, we present a selection of those queueing approaches and their influence on application runtime. First, we take a quick look at an initial task-pool approach with a single, global memory queue. Second, we consider a work-sharing approach with multiple task queues. Finally, we present an extension of the work-sharing approach with multiple, hierarchical task queues.
Collaboration opportunities include lock-free queueing approaches as well as message passing for multi-GPU and CPU-GPU tasking.
In this short talk we compare two different parallelization strategies for our Fast Multipole Method.
Using the traditional loop-based OpenMP approach on one hand and our own task-based approach Eventify on the other, we show their effectiveness and efficiency with respect to effort and scaling.
We discuss advantages and bottlenecks of both approaches and highlight how algorithm-specific knowledge helps to improve scalability.
Since our tasking approach is currently limited to shared memory only, we like to extend its applicability to distributed memory systems via message passing.
We present a method to manage the complexity of modern memory systems that is composed a portable and abstracted API to identify memory kinds describing hardware characteristics by metrics as bandwidth, latency, and capacity. It allows runtime system, parallel libraries, scientific applications select appropriately the memory to be used.
In addition, we present a survey using static code analysis, profiling and benchmarking to determine sensitivity of application buffers. And by combining these approaches with our API enabling a portable and productive method to match applications requirements and hardware memory characteristics.
Open Question.
Target capacity needs to be discussed when it conflict with affinities, for example when trying to allocate two 10 GB buffers on 16GB MCDRAM on KNL?
Collaboration opportunities:
It would be interesting to collaborate with people from Argonne working with AML which is a memory management library.
The organizers of the 12th JLESC Workshop are dedicated to providing a harassment-free experience for everyone, regardless of gender, gender identity and expression, age, sexual orientation, disability, physical appearance, body size, race, ethnicity, religion (or lack thereof), technology choices, or other group status.
To make clear what is expected, everyone taking part in the event - speakers, helpers, organizers, and participants - is required to conform to the Berlin Code of Conduct. The full text of the Code of Conduct can be found at http://berlincodeofconduct.org/.
To give a brief overview here, you are expected to:
The following behavior is unacceptable: intimidating, harassing, abusive, discriminatory, derogatory or demeaning speech or actions by any participant in our community online, at all related events and in one-on-one communications carried out in the context of community business.
Harassment includes harmful or prejudicial verbal or written comments related to gender, sexual orientation, race, religion, disability; inappropriate use of nudity and/or sexual images (including presentation slides); inappropriate depictions of violence (including presentation slides); deliberate intimidation, stalking or following; harassing photography or recording; sustained disruption of talks or other events.
If you witness or are subject to unacceptable behavior, please contact one of the workshop organizers via email or Slack. You can do so anonymously.