AutoDiagn: An Automated Real-time Diagnosis Framework for Big Data Systems (original) (raw)

SmartMonit: Real-time Big Data Monitoring System

SmartMonit: Real-time Big Data Monitoring System, 2019

Modern big data processing systems are becoming very complex in terms of large-scale, high-concurrency and multiple talents. Thus, many failures and performance reductions only happen at run-time and are very difficult to capture. Moreover, some issues may only be triggered when some components are executed. To analyze the root cause of these types of issues, we have to capture the dependencies of each component in real-time. In this paper, we propose SmartMonit, a real-time big data monitoring system, which collects infrastructure information such as the process status of each task. At the same time, we develop a real-time stream processing framework to analyze the coordination among the tasks and the infrastructures. This coordination information is essential for troubleshooting the reasons for failures and performance reduction, especially the ones propagated from other causes.

Ganesha: Black-Box Fault Diagnosis for MapReduce Systems (CMU-PDL-08-112)

2018

Ganesha aims to diagnose faults transparently in MapReduce systems, by analyzing OS-level metrics alone. Ganesha's approach is based on peer-symmetry under fault-free conditions, and can diagnose faults that manifest asymmetrically at nodes within a MapReduce system. While our training is performed on smaller Hadoop clusters and for specific workloads, our approach allows us to diagnose faults in larger Hadoop clusters and for unencountered workloads. We also candidly highlight faults that escape Ganesha's black-box diagnosis.

Performance Diagnosis Using Bigtable Monitoring for Cloud Computing System

International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), 2014

Reduplication services are allocated in individual physical nodes in the cloud. They can be amassed into various types of services, serving large amount of user requests. The existing diagnosis techniques for such distributed systems can’t effectively solve. In finegrained performance abnormality is still lacking. Production cloud systems should be completely unaided. CloudDiag systematically collects the point-to-point tracking data from each physical node in the cloud and classifies into different categories according to call tree. It then employs a customized Map-Reduce algorithm to proactively analyze the tracing data. Specifically, it assembles the tracing data of each user request, and classifies the tracing data into different categories according to call trees of the requests.