Fault-Aware Runtime Strategies for High-Performance Computing (original) (raw)
Related papers
A Proactive Fault Tolerance Approach to High Performance Computing (HPC) in the Cloud
2012 Second International Conference on Cloud and Green Computing, 2012
Extending the MPI specification for process fault tolerance on high performance computing systems
2004
Towards a faultaware computing environment
2008
Sensitivity of Application Performance to Resource Availability
Proceedings of the 10th EAI International Conference on Performance Evaluation Methodologies and Tools, 2017
Enhancing application robustness through adaptive fault tolerance
2008 IEEE International Symposium on Parallel and Distributed Processing, 2008
High Performance Dependable Multiprocessor II
2007 IEEE Aerospace Conference, 2007
A Primer on Architectural Level Fault Tolerance
Balancing Performance and Reliability in the Memory Hierarchy
IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005., 2005
Performance and Dependability Validation of Highly Parallel Fault-Tolerant Systems
1992
Performance issues for multicore processor operating systems
2014
Journal of Architectural Education, 2008
Workload and network-optimized computing systems
IBM Journal of Research and Development, 2010
Applied Optimization, 2000
Control of cascading failures using protective measures
Scientific reports, 2024
Real-time data-intensive computing
AIP Conference Proceedings, 2016
State-of-the-Art Technologies for Large-Scale Computing
Dubitsky/Large-Scale Computing, 2012
Proceedings of the Second HPI Cloud Symposium “Operating the Cloud” 2014
2015
Performance-reliability tradeoff analysis for multithreaded applications
2012 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012
Xception: Software Fault Injection and Monitoring In Processor Functional Units
Dependable Computing and Fault …, 1998
Critical path selection for performance optimization
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1993
IEICE Transactions on Information and Systems, 2014
System support for many task computing
2008 Workshop on Many-Task Computing on Grids and Supercomputers, 2008
Combined optimisation of system architecture and maintenance
2013
A Study of Software Development for High Performance Computing
Programming Environments for Massively Parallel Distributed Systems, 1994
High Performance Computing Clouds
CRC Press eBooks, 2017
Dependability engineering of complex computing systems
Proceedings Sixth IEEE International Conference on Engineering of Complex Computer Systems. ICECCS 2000, 2000
High speed and large scale scientific computing
2009
Chameleon: a software infrastructure for adaptive fault tolerance
IEEE Transactions on Parallel and Distributed Systems, 1999
Design and operation of large systems
Journal of Manufacturing Systems, 1995
Software fault tolerance in distributed systems using controlled re-execution
2000
Enhancing Dependability Through Flexible Adaptation to Changing Requirements
Luis Fernando Carrillo Andrade
Architecting Dependable Systems II, 2004
Current Trends in Parallel Computing
International Journal of Computer Applications, 2012
An Era of Change-Tolerant Systems
IEEE Computer, 2007
On system-wide failures in complex, evolving systems