Pedro Martins Dusso | TU Kaiserslautern (original) (raw)
Uploads
Papers by Pedro Martins Dusso
and evaluates an alternative sorting component for Hadoop based on the replacement-selection algo... more and evaluates an alternative sorting component for Hadoop based on the replacement-selection algorithm. Hadoop is an open source implementation of the MapReduce framework. MapReduce's popularity arises from the fact that it provides distribution transparency, linear scalability, and fault tolerance. This work proposes an alternative to the existing load-sort-store solution which can generate a small number of longer runs, resulting in a faster merge phase. The replacement selection algorithm usually produces runs that are larger than available memory, which in turn reduces the overall sorting time. This thesis first describes different sorting algorithms for in-memory sorting and external sorting; secondly, it describes Hadoop and Hadoop's map tasks and output buffer strategies; thirdly, it describes four sorting implementation alternatives, which are presented from the most simple to the most complex. Furthermore, this work analyzes the performance of these four alternatives in terms of execution time for run generation and merging, and the number of intermediate files produced. Finally, we provide a critical analysis with respect to Hadoop's default implementation. my very great appreciation to Professor Theo Härder for the continuous support since my first days in Kaiserslautern. I always had everything that was necessary to study and work -a great environment filled with marvelous and intelligent people. Professor
WattDB is a locally distributed database system that runs on a cluster of lightweight nodes. It a... more WattDB is a locally distributed database system that runs on a cluster of lightweight nodes. It aims to balance power consumption proportionally to the system’s load by dynamically powering its nodes individually up and down. Monitoring can serve as a basic building block for enabling adaptive and autonomic techniques, taking a special place in the cluster dynamics. This work provides a framework for monitoring and storing system usage counters and events of interest within the database. Monitored information is mapped, collected and consolidated to be saved in a historical database, which can be used to predict future query workloads in the cluster. This framework aims to be accurate, provide relevant and timely data, incurring low overhead in the nodes and in the network.
This work aims to investigate the state of the art of operator models in data management systems.... more This work aims to investigate the state of the art of operator models in data management systems. Nowadays, data volume is scaling faster than computer resources, leading us to build bigger and more complex computational clusters. The tools we have to analyze data flows only provide basic operators for simple, SQL-like analysis - most of them inherited from the RDBMS era. Big Data analytics requires more complex tasks, which today are embedded in user-defined functions hidden from query compiler and optimizer. Reliance on user-driven program optimizations is likely to lead to poor cluster utilization, and system-driven holistic optimization11 will require not just database query optimization, but also optimization of the whole data flow of this applications. In this direction, the extensible operator models permit the programmers to add new (possibly sophisticated) functionalities to data analysis tools. The query compiler can access the semantics of this new operators, potentially optimizing the data flow since application-specific functions are treated as first-class operators.
and evaluates an alternative sorting component for Hadoop based on the replacement-selection algo... more and evaluates an alternative sorting component for Hadoop based on the replacement-selection algorithm. Hadoop is an open source implementation of the MapReduce framework. MapReduce's popularity arises from the fact that it provides distribution transparency, linear scalability, and fault tolerance. This work proposes an alternative to the existing load-sort-store solution which can generate a small number of longer runs, resulting in a faster merge phase. The replacement selection algorithm usually produces runs that are larger than available memory, which in turn reduces the overall sorting time. This thesis first describes different sorting algorithms for in-memory sorting and external sorting; secondly, it describes Hadoop and Hadoop's map tasks and output buffer strategies; thirdly, it describes four sorting implementation alternatives, which are presented from the most simple to the most complex. Furthermore, this work analyzes the performance of these four alternatives in terms of execution time for run generation and merging, and the number of intermediate files produced. Finally, we provide a critical analysis with respect to Hadoop's default implementation. my very great appreciation to Professor Theo Härder for the continuous support since my first days in Kaiserslautern. I always had everything that was necessary to study and work -a great environment filled with marvelous and intelligent people. Professor
WattDB is a locally distributed database system that runs on a cluster of lightweight nodes. It a... more WattDB is a locally distributed database system that runs on a cluster of lightweight nodes. It aims to balance power consumption proportionally to the system’s load by dynamically powering its nodes individually up and down. Monitoring can serve as a basic building block for enabling adaptive and autonomic techniques, taking a special place in the cluster dynamics. This work provides a framework for monitoring and storing system usage counters and events of interest within the database. Monitored information is mapped, collected and consolidated to be saved in a historical database, which can be used to predict future query workloads in the cluster. This framework aims to be accurate, provide relevant and timely data, incurring low overhead in the nodes and in the network.
This work aims to investigate the state of the art of operator models in data management systems.... more This work aims to investigate the state of the art of operator models in data management systems. Nowadays, data volume is scaling faster than computer resources, leading us to build bigger and more complex computational clusters. The tools we have to analyze data flows only provide basic operators for simple, SQL-like analysis - most of them inherited from the RDBMS era. Big Data analytics requires more complex tasks, which today are embedded in user-defined functions hidden from query compiler and optimizer. Reliance on user-driven program optimizations is likely to lead to poor cluster utilization, and system-driven holistic optimization11 will require not just database query optimization, but also optimization of the whole data flow of this applications. In this direction, the extensible operator models permit the programmers to add new (possibly sophisticated) functionalities to data analysis tools. The query compiler can access the semantics of this new operators, potentially optimizing the data flow since application-specific functions are treated as first-class operators.