Elvis Rojas - Academia.edu (original) (raw)

Papers by Elvis Rojas

Research paper thumbnail of Understanding failures through the lifetime of a top-level supercomputer

Journal of Parallel and Distributed Computing

Abstract High performance computing systems are required to solve grand challenges in many scient... more Abstract High performance computing systems are required to solve grand challenges in many scientific disciplines. These systems assemble many components to be powerful enough for solving extremely complex problems. An inherent consequence is the intricacy of the interaction of all those components, especially when failures come into the picture. It is crucial to develop an understanding of how these systems fail to design reliable supercomputing platforms in the future. This paper presents the results on studying multi-year failure and workload records of a powerful supercomputer that topped the world rankings. We provide a thorough analysis of the data and characterize the reliability of the system through several dimensions: failure classification, failure-rate modelling, and interplay between failures and workload. The results shed some light on the dynamics of top-level supercomputers and sensitive areas ripe for improvement.

Research paper thumbnail of Exploring the Effects of Silent Data Corruption in Distributed Deep Learning Training

2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Research paper thumbnail of Early Experiences of Noise-Sensitivity Performance Analysis of a Distributed Deep Learning Framework

2022 IEEE International Conference on Cluster Computing (CLUSTER)

Research paper thumbnail of A Study of Checkpointing in Large Scale Training of Deep Neural Networks

arXiv (Cornell University), Dec 1, 2020

Deep learning (DL) applications are increasingly being deployed on HPC systems to leverage the ma... more Deep learning (DL) applications are increasingly being deployed on HPC systems to leverage the massive parallelism and computing power of those systems. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. Checkpoint-restart is a common fault tolerance technique in HPC workloads. In this work, we examine the checkpointing implementation of popular DL platforms. We perform experiments with three state-of-theart DL frameworks common in HPC (Chainer, PyTorch, and Ten-sorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide takeaway points that framework developers can use to facilitate better checkpointing of DL workloads in HPC.

Research paper thumbnail of Large-Scale Distributed Deep Learning: A Study of Mechanisms and Trade-Offs with PyTorch

Communications in computer and information science, 2022

Research paper thumbnail of Adult Exposures to Toxic Trace Elements as Measured in Nails along the Interoceanic Highway in the Peruvian Amazon

International Journal of Environmental Research and Public Health

Deforestation, artisanal and small-scale gold mining (ASGM), and the rapid development related to... more Deforestation, artisanal and small-scale gold mining (ASGM), and the rapid development related to highway expansion cause opportunities for toxic trace element exposure in the Amazon region of Madre de Dios (MDD), Peru, one of the most biologically diverse places in the world. The objective of this study was to assess the exposure to arsenic, cadmium, lead, and mercury among adults in Madre de Dios. In total, 418 adult (18+ years) participants in the Investigacion de Migracion, Ambiente, y Salud (IMAS) (Migration, Environment, and Health Study) participated in this study. Consent, survey data, and biospecimens were collected between August and November 2014. Nail elements were measured by inductively coupled plasma sector field mass spectrometry. Differences by selected individual and household characteristics and local land uses were tested using one-way ANOVAs and linear mixed models. Adults in ASGM-affected areas had higher nail arsenic and nail cadmium than their non-ASGM counte...

Research paper thumbnail of GestiónGestión del Talento Humano y su Relación con el Desempeño Laboral de los Colaboradores en la Unidad de Gestión Educativa Local de Chachapoyas – 2018

La investigacion titulada Gestion del Talento Humano y su Relacion con el Desempeno Laboral de lo... more La investigacion titulada Gestion del Talento Humano y su Relacion con el Desempeno Laboral de los Colaboradores en la Unidad de Gestion Educativa Local de Chachapoyas – 2018, tuvo como objetivo determinar la relacion entre la Gestion del Talento Humano y el Desempeno Laboral de los Colaboradores en la Unidad de Gestion Educativa Local de Chachapoyas – 2018. Para lo cual se partio de la hipotesis de Existe relacion directa entre Talento Humano y el Desempeno Laboral de los Colaboradores en la Unidad de Gestion Educativa Local de Chachapoyas – 2018 contrastada mediante la corelacion de Pearson. La metodologia empleada fue empirica enmarcada en la investigacion no experimental, donde se aplicaron encuestas e hicieron entrevistas; arrojando como resultados la relacion existente de acuerdo al area de trabajo, las relaciones interpersonales y la satisfaccion en funcion al cargo que desepena; para finalmete plantear una propuesta de mejora en la Gestion del Talento Humano. La conclusion a...

Research paper thumbnail of Towards a Model to Estimate the Reliability of Large-Scale Hybrid Supercomputers

Supercomputers stand as a fundamental tool for developing our understanding of the universe. Stat... more Supercomputers stand as a fundamental tool for developing our understanding of the universe. State-of-the-art scientific simulations, big data analyses, and machine learning executions require high performance computing platforms. Such infrastructures have been growing lately with the addition of thousands of newly designed components, calling their resiliency into question. It is crucial to solidify our knowledge on the way supercomputers fail. Other recent studies have highlighted the importance of characterizing failures on supercomputers. This paper aims at modelling component failures of a supercomputer based on Mixed Weibull distributions. The model is built using a real-life multi-year failure record from a leadership-class supercomputer. Using several key observations from the data, we designed an analytical model that is robust enough to represent each of the main components of supercomputers, yet it is flexible enough to alter the composition of the machine and be able to ...

Research paper thumbnail of Understanding Soft Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint Alteration

2021 IEEE International Conference on Cluster Computing (CLUSTER), 2021

The convergence of artificial intelligence, highperformance computing (HPC), and data science bri... more The convergence of artificial intelligence, highperformance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep learning (DL) models have been successfully applied to a wide spectrum of fields, from social network analysis to climate modeling. Such advances greatly benefit from already available HPC infrastructure, mainly GPU-enabled supercomputers. However, those powerful computing systems are exposed to failures, particularly silent data corruption (SDC) in which bit-flips occur without the program crashing. Consequently, exploring the impact of SDCs in DL models is vital for maintaining progress in many scientific domains. This paper uses a distinctive methodology to inject faults into training phases of DL models. We use checkpoint file alteration to study the effect of having bit-flips in different places of a model and at different moments of the training. Our strategy is general enough to allow the analysis of any combination of DL model and framework-so long as they produce a Hierarchical Data Format 5 checkpoint file. The experimental results confirm that popular DL models are often able to absorb dozens of bitflips with a minimal impact on accuracy convergence.

Research paper thumbnail of Analyzing a Five-Year Failure Record of a Leadership-Class Supercomputer

2019 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2019

Extreme-scale computing systems are required to solve some of the grand challenges in science and... more Extreme-scale computing systems are required to solve some of the grand challenges in science and technology. From astrophysics to molecular biology, supercomputers are an essential tool to accelerate scientific discovery. However, large computing systems are prone to failures due to their complexity. It is crucial to develop an understanding of how these systems fail to design reliable supercomputing platforms for the future. This paper examines a five-year failure and workload record of a leadership-class supercomputer. To the best of our knowledge, five years represents the vast majority of the lifespan of a supercomputer. This is the first time such analysis is performed on a top 10 modern supercomputer. We performed a failure categorization and found out that: i) most errors are GPUrelated, with roughly 37% of them being double-bit errors on the cards; ii) failures are not evenly spread across the physical machine, with room temperature presumably playing a major role; and iii) software errors of the system bring down several nodes concurrently. Our failure rate analysis unveils that: i) the system consistently degrades, being at least twice as reliable at the beginning, compared to the end of the period; ii) Weibull distribution closely fits the meantime between failure data; and iii) hardware and software errors show a markedly different pattern. Finally, we correlated failure and workload records to reveal that: i) failure and workload records are weakly correlated, except for certain types of failures when segmented by the hours of the day; ii) several categories of failures make jobs crash within the first minutes of execution; and iii) a significant fraction of failed jobs exhaust the requested time with a disregard of when the failure occurred during execution.

Research paper thumbnail of Evaluando la Resiliencia de Modelos de Deep Learning

Revista Tecnología en Marcha, 2020

Los modelos de Aprendizaje Profundo se han convertido en una valiosa herramienta para resolver pr... more Los modelos de Aprendizaje Profundo se han convertido en una valiosa herramienta para resolver problemas complejos en muchas áreas críticas. Es importante proveer confiabilidad en las salidas de la ejecución de estos modelos, aún si se producen fallos durante la ejecución. En este artículo presentamos la evaluación de la confiabilidad de tres modelos de aprendizaje profundo. Usamos un conjunto de datos de ImageNet y desarrollamos un inyector de fallos para realizar las pruebas. Los resultados muestran que entre los modelos hay una diferencia en la sensibilidad a los fallos. Además, hay modelos que a pesar del incremento en la tasa de fallos pueden mantener bajos los valores de error.

Research paper thumbnail of Understanding failures through the lifetime of a top-level supercomputer

Journal of Parallel and Distributed Computing

Abstract High performance computing systems are required to solve grand challenges in many scient... more Abstract High performance computing systems are required to solve grand challenges in many scientific disciplines. These systems assemble many components to be powerful enough for solving extremely complex problems. An inherent consequence is the intricacy of the interaction of all those components, especially when failures come into the picture. It is crucial to develop an understanding of how these systems fail to design reliable supercomputing platforms in the future. This paper presents the results on studying multi-year failure and workload records of a powerful supercomputer that topped the world rankings. We provide a thorough analysis of the data and characterize the reliability of the system through several dimensions: failure classification, failure-rate modelling, and interplay between failures and workload. The results shed some light on the dynamics of top-level supercomputers and sensitive areas ripe for improvement.

Research paper thumbnail of Exploring the Effects of Silent Data Corruption in Distributed Deep Learning Training

2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Research paper thumbnail of Early Experiences of Noise-Sensitivity Performance Analysis of a Distributed Deep Learning Framework

2022 IEEE International Conference on Cluster Computing (CLUSTER)

Research paper thumbnail of A Study of Checkpointing in Large Scale Training of Deep Neural Networks

arXiv (Cornell University), Dec 1, 2020

Deep learning (DL) applications are increasingly being deployed on HPC systems to leverage the ma... more Deep learning (DL) applications are increasingly being deployed on HPC systems to leverage the massive parallelism and computing power of those systems. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. Checkpoint-restart is a common fault tolerance technique in HPC workloads. In this work, we examine the checkpointing implementation of popular DL platforms. We perform experiments with three state-of-theart DL frameworks common in HPC (Chainer, PyTorch, and Ten-sorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide takeaway points that framework developers can use to facilitate better checkpointing of DL workloads in HPC.

Research paper thumbnail of Large-Scale Distributed Deep Learning: A Study of Mechanisms and Trade-Offs with PyTorch

Communications in computer and information science, 2022

Research paper thumbnail of Adult Exposures to Toxic Trace Elements as Measured in Nails along the Interoceanic Highway in the Peruvian Amazon

International Journal of Environmental Research and Public Health

Deforestation, artisanal and small-scale gold mining (ASGM), and the rapid development related to... more Deforestation, artisanal and small-scale gold mining (ASGM), and the rapid development related to highway expansion cause opportunities for toxic trace element exposure in the Amazon region of Madre de Dios (MDD), Peru, one of the most biologically diverse places in the world. The objective of this study was to assess the exposure to arsenic, cadmium, lead, and mercury among adults in Madre de Dios. In total, 418 adult (18+ years) participants in the Investigacion de Migracion, Ambiente, y Salud (IMAS) (Migration, Environment, and Health Study) participated in this study. Consent, survey data, and biospecimens were collected between August and November 2014. Nail elements were measured by inductively coupled plasma sector field mass spectrometry. Differences by selected individual and household characteristics and local land uses were tested using one-way ANOVAs and linear mixed models. Adults in ASGM-affected areas had higher nail arsenic and nail cadmium than their non-ASGM counte...

Research paper thumbnail of GestiónGestión del Talento Humano y su Relación con el Desempeño Laboral de los Colaboradores en la Unidad de Gestión Educativa Local de Chachapoyas – 2018

La investigacion titulada Gestion del Talento Humano y su Relacion con el Desempeno Laboral de lo... more La investigacion titulada Gestion del Talento Humano y su Relacion con el Desempeno Laboral de los Colaboradores en la Unidad de Gestion Educativa Local de Chachapoyas – 2018, tuvo como objetivo determinar la relacion entre la Gestion del Talento Humano y el Desempeno Laboral de los Colaboradores en la Unidad de Gestion Educativa Local de Chachapoyas – 2018. Para lo cual se partio de la hipotesis de Existe relacion directa entre Talento Humano y el Desempeno Laboral de los Colaboradores en la Unidad de Gestion Educativa Local de Chachapoyas – 2018 contrastada mediante la corelacion de Pearson. La metodologia empleada fue empirica enmarcada en la investigacion no experimental, donde se aplicaron encuestas e hicieron entrevistas; arrojando como resultados la relacion existente de acuerdo al area de trabajo, las relaciones interpersonales y la satisfaccion en funcion al cargo que desepena; para finalmete plantear una propuesta de mejora en la Gestion del Talento Humano. La conclusion a...

Research paper thumbnail of Towards a Model to Estimate the Reliability of Large-Scale Hybrid Supercomputers

Supercomputers stand as a fundamental tool for developing our understanding of the universe. Stat... more Supercomputers stand as a fundamental tool for developing our understanding of the universe. State-of-the-art scientific simulations, big data analyses, and machine learning executions require high performance computing platforms. Such infrastructures have been growing lately with the addition of thousands of newly designed components, calling their resiliency into question. It is crucial to solidify our knowledge on the way supercomputers fail. Other recent studies have highlighted the importance of characterizing failures on supercomputers. This paper aims at modelling component failures of a supercomputer based on Mixed Weibull distributions. The model is built using a real-life multi-year failure record from a leadership-class supercomputer. Using several key observations from the data, we designed an analytical model that is robust enough to represent each of the main components of supercomputers, yet it is flexible enough to alter the composition of the machine and be able to ...

Research paper thumbnail of Understanding Soft Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint Alteration

2021 IEEE International Conference on Cluster Computing (CLUSTER), 2021

The convergence of artificial intelligence, highperformance computing (HPC), and data science bri... more The convergence of artificial intelligence, highperformance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep learning (DL) models have been successfully applied to a wide spectrum of fields, from social network analysis to climate modeling. Such advances greatly benefit from already available HPC infrastructure, mainly GPU-enabled supercomputers. However, those powerful computing systems are exposed to failures, particularly silent data corruption (SDC) in which bit-flips occur without the program crashing. Consequently, exploring the impact of SDCs in DL models is vital for maintaining progress in many scientific domains. This paper uses a distinctive methodology to inject faults into training phases of DL models. We use checkpoint file alteration to study the effect of having bit-flips in different places of a model and at different moments of the training. Our strategy is general enough to allow the analysis of any combination of DL model and framework-so long as they produce a Hierarchical Data Format 5 checkpoint file. The experimental results confirm that popular DL models are often able to absorb dozens of bitflips with a minimal impact on accuracy convergence.

Research paper thumbnail of Analyzing a Five-Year Failure Record of a Leadership-Class Supercomputer

2019 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2019

Extreme-scale computing systems are required to solve some of the grand challenges in science and... more Extreme-scale computing systems are required to solve some of the grand challenges in science and technology. From astrophysics to molecular biology, supercomputers are an essential tool to accelerate scientific discovery. However, large computing systems are prone to failures due to their complexity. It is crucial to develop an understanding of how these systems fail to design reliable supercomputing platforms for the future. This paper examines a five-year failure and workload record of a leadership-class supercomputer. To the best of our knowledge, five years represents the vast majority of the lifespan of a supercomputer. This is the first time such analysis is performed on a top 10 modern supercomputer. We performed a failure categorization and found out that: i) most errors are GPUrelated, with roughly 37% of them being double-bit errors on the cards; ii) failures are not evenly spread across the physical machine, with room temperature presumably playing a major role; and iii) software errors of the system bring down several nodes concurrently. Our failure rate analysis unveils that: i) the system consistently degrades, being at least twice as reliable at the beginning, compared to the end of the period; ii) Weibull distribution closely fits the meantime between failure data; and iii) hardware and software errors show a markedly different pattern. Finally, we correlated failure and workload records to reveal that: i) failure and workload records are weakly correlated, except for certain types of failures when segmented by the hours of the day; ii) several categories of failures make jobs crash within the first minutes of execution; and iii) a significant fraction of failed jobs exhaust the requested time with a disregard of when the failure occurred during execution.

Research paper thumbnail of Evaluando la Resiliencia de Modelos de Deep Learning

Revista Tecnología en Marcha, 2020

Los modelos de Aprendizaje Profundo se han convertido en una valiosa herramienta para resolver pr... more Los modelos de Aprendizaje Profundo se han convertido en una valiosa herramienta para resolver problemas complejos en muchas áreas críticas. Es importante proveer confiabilidad en las salidas de la ejecución de estos modelos, aún si se producen fallos durante la ejecución. En este artículo presentamos la evaluación de la confiabilidad de tres modelos de aprendizaje profundo. Usamos un conjunto de datos de ImageNet y desarrollamos un inyector de fallos para realizar las pruebas. Los resultados muestran que entre los modelos hay una diferencia en la sensibilidad a los fallos. Además, hay modelos que a pesar del incremento en la tasa de fallos pueden mantener bajos los valores de error.