Publications
Scalable Dense Linear Algebra on Heterogeneous Hardware,”
HPC: Transition Towards Exascale Processing, in the series Advances in Parallel Computing, 2013.
(760.32 KB)
“System Software for Many-Core and Multi-Core Architectures,”
Advanced Software Technologies for Post-Peta Scale Computing: The Japanese Post-Peta CREST Research Project, Singapore, Springer Singapore, pp. 59–75, 2019.
DOI: 10.1007/978-981-13-1924-2_4
“Assessing the Impact of ABFT and Checkpoint Composite Strategies,”
16th Workshop on Advances in Parallel and Distributed Computational Models, IPDPS 2014, Phoenix, AZ, IEEE, May 2014.
(1.02 MB)
“Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications,”
Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop at The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'19), November 2019.
(440.7 KB)
“Do moldable applications perform better on failure-prone HPC platforms?,”
11th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids, Turin, Italy, Springer Verlag, August 2018.
(360.72 KB)
“Efficient Parallelization of Batch Pattern Training Algorithm on Many-core and Cluster Architectures,”
7th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems, Berlin, Germany, September 2013.
(102.51 KB)
“Flexible Data Redistribution in a Task-Based Runtime System,”
IEEE International Conference on Cluster Computing (Cluster 2020), Kobe, Japan, IEEE, September 2020.
DOI: 10.1109/CLUSTER49012.2020.00032 (354.8 KB)
“From Serial Loops to Parallel Execution on Distributed Systems,”
International European Conference on Parallel and Distributed Computing (Euro-Par '12), Rhodes, Greece, August 2012.
(203.08 KB)
“Hierarchical DAG scheduling for Hybrid Distributed Systems,”
29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Hyderabad, India, IEEE, May 2015.
(1.11 MB)
“Multi-criteria Checkpointing Strategies: Response-Time versus Resource Utilization,”
Euro-Par 2013, Aachen, Germany, Springer, August 2013.
(431.84 KB)
“A Multithreaded Communication Substrate for OpenSHMEM,”
8th International Conference on Partitioned Global Address Space Programming Models (PGAS), Eugene, OR, October 2014.
(261.66 KB)
“Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms,”
2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Best Paper Award, Vancouver, BC, Canada, IEEE, May 2018.
DOI: 10.1109/IPDPSW.2018.00127 (899.3 KB)
“Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery,”
22nd European MPI Users' Group Meeting, Bordeaux, France, ACM, September 2015.
DOI: 10.1145/2802658.2802668 (543.32 KB)
“Practical Scalable Consensus for Pseudo-Synchronous Distributed Systems,”
The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15), Austin, TX, ACM, November 2015.
(550.96 KB)
“PTG: An Abstraction for Unhindered Parallelism,”
International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC), New Orleans, LA, IEEE Press, November 2014.
(480.05 KB)
“Reasons for a Pessimistic or Optimistic Message Logging Protocol in MPI Uncoordinated Failure Recovery,”
CLUSTER '09, New Orleans, IEEE, August 2009.
DOI: 10.1109/CLUSTR.2009.5289157 (191.36 KB)
“Runtime Level Failure Detection and Propagation in HPC Systems,”
European MPI Users' Group Meeting (EuroMPI '19), Zürich, Switzerland, ACM, September 2019.
DOI: 10.1145/3343211.3343225 (1.11 MB)
“Algorithm-Based Fault Tolerance for Dense Matrix Factorization,”
Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012, New Orleans, LA, USA, ACM, pp. 225-234, February 2012.
DOI: 10.1145/2145816.2145845 (865.79 KB)
“A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI,”
18th International European Conference on Parallel and Distributed Computing (Euro-Par 2012) (Best Paper Award), Rhodes, Greece, Springer-Verlag, August 2012.
(289.32 KB)
“Correlated Set Coordination in Fault Tolerant Message Logging Protocols,”
Proceedings of 17th International Conference, Euro-Par 2011, Part II, vol. 6853, Bordeaux, France, Springer, pp. 51-64, August 2011.
(486.68 KB)
“DAGuE: A Generic Distributed DAG Engine for High Performance Computing,”
Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops), Anchorage, Alaska, USA, IEEE, pp. 1151-1158, 00 2011.
(830.85 KB)
“Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols,”
Proceedings of EuroMPI 2010, Stuttgart, Germany, Springer, September 2010.
(202.87 KB)
“Evaluating Contexts in OpenSHMEM-X Reference Implementation,”
OpenSHMEM and Related Technologies. Big Compute and Big Data Convergence, Cham, Springer International Publishing, pp. 50–62, 2018.
DOI: 10.1007/978-3-319-73814-7_4
“An Evaluation of User-Level Failure Mitigation Support in MPI,”
Proceedings of Recent Advances in Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012, Vienna, Austria, Springer, September 2012.
“Failure Detection and Propagation in HPC Systems,”
Proceedings of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16), Salt Lake City, Utah, IEEE Press, pp. 27:1-27:11, November 2016.
“Fault Tolerance Management for a Hierarchical GridRPC Middleware,”
8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), Lyon, France, January 2008.
(319.79 KB)
“Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA,”
Proceedings of the Workshops of the 25th IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2011 Workshops), Anchorage, Alaska, USA, IEEE, pp. 1432-1441, May 2011.
(1.26 MB)
“From MPI to OpenSHMEM: Porting LAMMPS,”
OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies, Annapolis, MD, USA, Springer International Publishing, pp. 121–137, 2015.
DOI: 10.1007/978-3-319-26428-8_8
“Kernel Assisted Collective Intra-node MPI Communication Among Multi-core and Many-core CPUs,”
Int'l Conference on Parallel Processing (ICPP '11), Taipei, Taiwan, September 2011.
“Locality and Topology aware Intra-node Communication Among Multicore CPUs,”
Proceedings of the 17th EuroMPI conference, Stuttgart, Germany, LNCS, September 2010.
(327.01 KB)
“PMIx: Process Management for Exascale Environments,”
Proceedings of the 24th European MPI Users' Group Meeting, New York, NY, USA, ACM, pp. 14:1–14:10, 2017.
DOI: 10.1145/3127024.3127027
“Redesigning the Message Logging Model for High Performance,”
International Supercomputer Conference (ISC 2008), Dresden, Germany, January 2008.
(622.1 KB)
“Revisiting Credit Distribution Algorithms for Distributed Termination Detection,”
2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW): IEEE, pp. 611–620, 2021.
DOI: 10.1109/IPDPSW52791.2021.00095
“Surviving Errors with OpenSHMEM,”
OpenSHMEM and Related Technologies. Enhancing OpenSHMEM for Hybrid Environments, Baltimore, MD, USA, Springer International Publishing, pp. 66–81, 2016.
“UCX: An Open Source Framework for HPC Network APIs and Beyond,”
2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, Santa Clara, CA, USA, IEEE, pp. 40-43, 2015.
DOI: 10.1109/HOTI.2015.13
“A Unified HPC Environment for Hybrid Manycore/GPU Distributed Systems,”
IEEE International Parallel and Distributed Processing Symposium (submitted), Anchorage, AK, May 2011.
“Algorithm-based Fault Tolerance for Dense Matrix Factorizations, Multiple Failures, and Accuracy,”
ACM Transactions on Parallel Computing, vol. 1, issue 2, no. 10, pp. 10:1-10:28, January 2015.
DOI: 10.1145/2686892 (1.14 MB)
“Checkpointing Strategies for Shared High-Performance Computing Platforms,”
International Journal of Networking and Computing, vol. 9, no. 1, pp. 28–52, 2019.
(490.5 KB)
“Comparing the Performance of Rigid, Moldable, and Grid-Shaped Applications on Failure-Prone HPC Platforms,”
Parallel Computing, vol. 85, pp. 1–12, July 2019.
DOI: 10.1016/j.parco.2019.02.002 (865.18 KB)
“Composing Resilience Techniques: ABFT, Periodic, and Incremental Checkpointing,”
International Journal of Networking and Computing, vol. 5, no. 1, pp. 2-15, January 2015.
(755.54 KB)
“Correlated Set Coordination in Fault Tolerant Message Logging Protocols,”
Concurrency and Computation: Practice and Experience, vol. 25, issue 4, pp. 572-585, March 2013.
DOI: 10.1002/cpe.2859 (636.68 KB)
“DAGuE: A generic distributed DAG Engine for High Performance Computing.,”
Parallel Computing, vol. 38, no. 1-2: Elsevier, pp. 27-51, 00 2012.
(830.85 KB)
“Dense Linear Algebra on Distributed Heterogeneous Hardware with a Symbolic DAG Approach,”
Scalable Computing and Communications: Theory and Practice: John Wiley & Sons, pp. 699-735, March 2013.
(1.01 MB)
“An evaluation of User-Level Failure Mitigation support in MPI,”
Computing, vol. 95, issue 12, pp. 1171-1184, December 2013.
DOI: 10.1007/s00607-013-0331-3 (311.23 KB)
“Extending the scope of the Checkpoint-on-Failure protocol for forward recovery in standard MPI,”
Concurrency and Computation: Practice and Experience, July 2013.
DOI: 10.1002/cpe.3100 (3.89 MB)
“A Failure Detector for HPC Platforms,”
The International Journal of High Performance Computing Applications, vol. 32, issue 1, pp. 139–158, January 2018.
DOI: 10.1177/1094342017711505 (1.04 MB)
“Fault Tolerance of MPI Applications in Exascale Systems: The ULFM Solution,”
Future Generation Computer Systems, vol. 106, pp. 467-481, May 2020.
DOI: 10.1016/j.future.2020.01.026 (2.06 MB)
“A Framework for Out of Memory SVD Algorithms,”
ISC High Performance 2017, pp. 158–178, June 2017.
DOI: 10.1007/978-3-319-58667-0_9 (393.22 KB)
“HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters,”
IPDPS 2012 (Best Paper), Shanghai, China, May 2012.
(165.9 KB)
“Impact of Kernel-Assisted MPI Communication over Scientific Applications: CPMD and FFTW,”
18th EuroMPI, Santorini, Greece, Springer, pp. 247-254, September 2011.
“