Fault-Aware Runtime Strategies for High-Performance Computing (original) (raw)

A Proactive Fault Tolerance Approach to High Performance Computing (HPC) in the Cloud

David Levy

2012 Second International Conference on Cloud and Green Computing, 2012

View PDFchevron_right

Extending the MPI specification for process fault tolerance on high performance computing systems

J. Pjesivac-grbovic

2004

View PDFchevron_right

Towards a faultaware computing environment

Zhiling Lan

2008

View PDFchevron_right

Sensitivity of Application Performance to Resource Availability

Ajitha Rajan

Proceedings of the 10th EAI International Conference on Performance Evaluation Methodologies and Tools, 2017

View PDFchevron_right

Enhancing application robustness through adaptive fault tolerance

Zhiling Lan

2008 IEEE International Symposium on Parallel and Distributed Processing, 2008

View PDFchevron_right

High Performance Dependable Multiprocessor II

Grzegorz Cieslewski

2007 IEEE Aerospace Conference, 2007

View PDFchevron_right

A Primer on Architectural Level Fault Tolerance

Ricky Butler

View PDFchevron_right

Balancing Performance and Reliability in the Memory Hierarchy

Hossein Asadi

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005., 2005

View PDFchevron_right

Performance and Dependability Validation of Highly Parallel Fault-Tolerant Systems

Kishor S Trivedi

1992

View PDFchevron_right

Performance issues for multicore processor operating systems

Sabaretnam Sajiharan

2014

View PDFchevron_right

Performance/Architecture

Omar Khan

Journal of Architectural Education, 2008

View PDFchevron_right

Workload and network-optimized computing systems

Hubertus Franke

IBM Journal of Research and Development, 2010

View PDFchevron_right

High Performance Optimization

Shuzhong Zhang

Applied Optimization, 2000

View PDFchevron_right

Control of cascading failures using protective measures

Mozhgan Khanjanianpak

Scientific reports, 2024

View PDFchevron_right

Real-time data-intensive computing

Keith Beattie

AIP Conference Proceedings, 2016

View PDFchevron_right

State-of-the-Art Technologies for Large-Scale Computing

Chaker El Amrani

Dubitsky/Large-Scale Computing, 2012

View PDFchevron_right

Proceedings of the Second HPI Cloud Symposium “Operating the Cloud” 2014

Mohamed Esam Elsaid

2015

View PDFchevron_right

Performance-reliability tradeoff analysis for multithreaded applications

Oguz Tosun

2012 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012

View PDFchevron_right

Xception: Software Fault Injection and Monitoring In Processor Functional Units

João Gabriel

Dependable Computing and Fault …, 1998

View PDFchevron_right

Critical path selection for performance optimization

David Du

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1993

View PDFchevron_right

Understanding Variations for Better Adjusting Parallel Supplemental Redundant Executions to Tolerate Timing Faults

Yukihiro Sasagawa

IEICE Transactions on Information and Systems, 2014

View PDFchevron_right

System support for many task computing

Eric Van Hensbergen

2008 Workshop on Many-Task Computing on Grids and Supercomputers, 2008

View PDFchevron_right

Combined optimisation of system architecture and maintenance

Yiannis Papadopoulos

2013

View PDFchevron_right

A Study of Software Development for High Performance Computing

Salim Hariri

Programming Environments for Massively Parallel Distributed Systems, 1994

View PDFchevron_right

High Performance Computing Clouds

Andrzej Goscinski

CRC Press eBooks, 2017

View PDFchevron_right

Dependability engineering of complex computing systems

Mohamed Kaaniche

Proceedings Sixth IEEE International Conference on Engineering of Complex Computer Systems. ICECCS 2000, 2000

View PDFchevron_right

High speed and large scale scientific computing

Gerhard Joubert

2009

View PDFchevron_right

Chameleon: a software infrastructure for adaptive fault tolerance

Saurabh Bagchi

IEEE Transactions on Parallel and Distributed Systems, 1999

View PDFchevron_right

Design and operation of large systems

nam suh

Journal of Manufacturing Systems, 1995

View PDFchevron_right

Software fault tolerance in distributed systems using controlled re-execution

Vijay Garg

2000

View PDFchevron_right

Enhancing Dependability Through Flexible Adaptation to Changing Requirements

Luis Fernando Carrillo Andrade

Architecting Dependable Systems II, 2004

View PDFchevron_right

Current Trends in Parallel Computing

FIROJ ALI SK

International Journal of Computer Applications, 2012

View PDFchevron_right

An Era of Change-Tolerant Systems

Shawn Bohner

IEEE Computer, 2007

View PDFchevron_right

On system-wide failures in complex, evolving systems

paul ormerod

View PDFchevron_right

Performance Improvements at the

Fred D Lang

View PDFchevron_right