Analyzing Cost Parameters Affecting Map Reduce Application Performance (original) (raw)

A Survey Work on Optimization Techniques Utilizing Map Reduce Framework in Hadoop Cluster

International Journal of Intelligent Systems and Applications, 2017

Data is one of the most important and vital aspect of different activities in today's world. Therefore vast amount of data is generated in each and every second. A rapid growth of data in recent time in different domains required an intelligent data analysis tool that would be helpful to satisfy the need to analysis a huge amount of data. Map Reduce framework is basically designed to process large amount of data and to support effective decision making. It consists of two important tasks named as map and reduce. Optimization is the act of achieving the best possible result under given circumstances. The goal of the map reduce optimization is to minimize the execution time and to maximize the performance of the system. This survey paper discusses a comparison between different optimization techniques used in Map Reduce framework and in big data analytics. Various sources of big data generation have been summarized based on various applications of big data.The wide range of application domains for big data analytics is because of its adaptable characteristics like volume, velocity, variety, veracity and value .The mentioned characteristics of big data are because of inclusion of structured, semi structured, unstructured data for which new set of tools like NOSQL, MAPREDUCE, HADOOP etc are required. The presented survey though provides an insight towards the fundamentals of big data analytics but aims towards an analysis of various optimization techniques used in map reduce framework and big data analytics.

Towards Performance Optimization for Hadoop MapReduce Applications

2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2020

Apache Hadoop is a widely used open-source distributed platform towards big data processing and provides YARN based distributed parallel processing framework on low cost commodity machines. However, YARN adopts static resource management (that is, the number of containers available per node and the size of each container are static in nature) depending on pre-configured default resource units called containers leading to poor performance to deal with various sort of MapReduce applications. In addition, during the last wave of a job, many available resources occur frequently being idle because YARN does not consider the wave behavior in tasks of MapReduce applications. To take advantage of idle resources resulting in performance improvement, the important parameter, the number of map tasks is needed to optimize based on the available resources and governed by split size. Therefore, this parameter is optimized through the split size tuning based on the available resources. To address ...

Implementation of Big-Data Applications Using Map Reduce Framework

International Journal of Engineering and Computer Science, 2020

Clustering As a result of the rapid development in cloud computing, it & fundamental to investigate the performance of extraordinary Hadoop MapReduce purposes and to realize the performance bottleneck in a cloud cluster that contributes to higher or diminish performance. It is usually primary to research the underlying hardware in cloud cluster servers to permit the optimization of program and hardware to achieve the highest performance feasible. Hadoop is founded on MapReduce, which is among the most popular programming items for huge knowledge analysis in a parallel computing environment. In this paper, we reward a particular efficiency analysis, characterization, and evaluation of Hadoop MapReduce Word Count utility. The main aim of this paper is to give implements of Hadoop map-reduce programming by giving a hands-on experience in developing Hadoop based Word-Count and Apriori application. Word count problem using Hadoop Map Reduce framework. The Apriori Algorithm has been used ...

MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON

Map Reduce has gained remarkable significance as a prominent parallel data processing tool in the research community, academia and industry with the spurt in volume of data that is to be analyzed. Map Reduce is used in different applications such as data mining, data analytics where massive data analysis is required, but still it is constantly being explored on different parameters such as performance and efficiency. This survey intends to explore large scale data processing using MapReduce and its various implementations to facilitate the database, researchers and other communities in developing the technical understanding of the MapReduce framework. In this survey, different MapReduce implementations are explored and their inherent features are compared on different parameters. It also addresses the open issues and challenges raised on fully functional DBMS/Data Warehouse on MapReduce. The comparison of various Map Reduce implementations is done with the most popular implementation Hadoop and other similar implementations using other platforms.

Hadoop framework implementation and performance analysis on a cloud

The Hadoop framework uses the MapReduce programming paradigm to process big data by distributing data across a cluster and aggregating. MapReduce is one of the methods used to process big data hosted on large clusters. In this method, jobs are processed by dividing into small pieces and distributing over nodes. Parameters such as distributing method over nodes, the number of jobs held in a parallel fashion, and the number of nodes in the cluster affect the execution time of jobs. The aim of this paper is to determine how the numbers of nodes, maps, and reduces affect the performance of the Hadoop framework in a cloud environment. For this purpose, tests were carried out on a Hadoop cluster with 10 nodes hosted in a cloud environment by running PiEstimator, Grep, Teragen, and Terasort benchmarking tools on it. These benchmarking tools available under the Hadoop framework are classified as CPU-intensive and CPU-light applications as a result of tests. In CPU-light applications, increasing the numbers of nodes, maps, and reduces does not improve the efficiency of these applications; they even cause an increase in time spent on jobs by using system resources unnecessarily. Therefore, in CPU-light applications, selecting the numbers of nodes, maps, and reduces as minimum is found as the optimization of time spent on a process. In CPU-intensive applications, according to the phase that small job pieces is processed, it is found that selecting the number of maps or reduces equal to total number of CPUs on a cluster is the optimization of time spent on a process.

IJERT-Minimizing Time Span of Big Data Analytics using Hadoop -Map Reduce

International Journal of Engineering Research and Technology (IJERT), 2014

https://www.ijert.org/minimizing-time-span-of-big-data-analytics-using-hadoop-map-reduce https://www.ijert.org/research/minimizing-time-span-of-big-data-analytics-using-hadoop-map-reduce-IJERTV3IS052041.pdf Private and public clouds offer a new delivery model with virtually unlimited computing and storage resources. An increasing number of companies are exploiting the MapReduce paradigm and its open-source implementation Hadoop as a platform choice for efficient Big Data processing and advanced analytics over unstructured information. This new style of large data processing enables businesses to extract information and discover novel data insights in a nontraditional and game-changing way. For many companies, their core business depends on a timely analysis and processing of large quantities of new data. The data analysis applications might be of different complexities, resource needs, and data delivery deadlines. This diversity creates competing requirements for program design, job scheduling, and workload management policies in MapReduce environments. In this Paper, Hadoop MapReduce to perform word count is implemented. A map function is specified to count the number of words in the distributed nodes that produces intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. We have also conducted a Simulation based estimation. The result obtained from the simulation shows that the wordcount using Map Reduce consumes less amount of time when compared with the result obtained without using map reduce programming .

DETERMINATION OF RESOURCE USAGE CHARACTERISTICS FOR HADOOP MAP REDUCE TASKS

Hadoop is a common frame work used to process large amounts of data. It uses map reduce framework to divide the data and process it parallel on multiple nodes. Different jobs have different resource usages of CPU and IO and similarly different nodes have different loads. If resource usage of jobs and resource availability of nodes are considered in the decision of scheduling of multiple map and reduce tasks of different jobs, an optimized execution time can be obtained. It is more useful in could environment as map/reduce tasks execute on virtual machines in spite of physical machines. As parts of research conducted to build a dynamic scheduler for map reduce applications considering job and VM characteristics, this paper proposes a technique to study the job characteristic in terms of CPU and IO of usage.

Efficient Processing of Job by Enhancing Hadoop Mapreduce Framework

International Journal of Advanced Research in Computer Science

Cloud Computing uses Hadoop framework for processing BigData in parallel. The Hadoop Map Reduce programming paradigm used in the context of Big Data, is one of the popular approaches that abstract the characterstics of parallel and distributed computing which comes off as a solution to Big Data. Improving performance of Map Reduce is a major concern as it affects the energy efficiency. Improving the energy efficiency of Map Reduce will have significant impact on energy savings for data centers. There are many parameters that influence the performance of Map Reduce. Various parameters like scheduling, resource allocation and data flow have a significant impact on Map Reduce performance. Cloud Computing leverages Hadoop framework for processing BigData in parallel. Hadoop has certain limitations that could be exploited to execute the job efficiently. Efficient resource allocation remains a challenge in Cloud Computing MapReduce platforms. We propose a methodology which is an enhanced Hadoop architecture that reduces the computation cost associated with BigData analysis.