Parallel Processing of Public Open Data with the MapReduce Paradigm: A Case Study (original) (raw)

Parallel data processing with MapReduce

ACM SIGMOD Record, 2012

A prominent parallel data processing tool MapReduce is gaining significant momentum from both industry and academia as the volume of data to analyze grows rapidly. While MapReduce is used in many areas where massive data analysis is required, there are still debates on its performance, efficiency per node, and simple abstraction. This survey intends to assist the database and open source communities in understanding various technical aspects of the MapReduce framework. In this survey, we characterize the MapReduce framework and discuss its inherent pros and cons. We then introduce its optimization strategies reported in the recent literature. We also discuss the open issues and challenges raised on parallel data analysis with MapReduce.

A Study on MapReduce: Challenges and Trends

Indonesian Journal of Electrical Engineering and Computer Science, 2016

Nowadays we all are surrounded by Big data. The term ‘Big Data’ itself indicates huge volume, high velocity, variety and veracity i.e. uncertainty of data which gave rise to new difficulties and challenges. Big data generated may be structured data, Semi Structured data or unstructured data. For existing database and systems lot of difficulties are there to process, analyze, store and manage such a Big Data. The Big Data challenges are Protection, Curation, Capture, Analysis, Searching, Visualization, Storage, Transfer and sharing. Map Reduce is a framework using which we can write applications to process huge amount of data, in parallel, on large clusters of commodity hardware in a reliable manner. Lot of efforts have been put by different researchers to make it simple, easy, effective and efficient. In our survey paper we emphasized on the working of Map Reduce, challenges, opportunities and recent trends so that researchers can think on further improvement.

An Extensive Investigate the MapReduce Technology

International Journal of Computer Sciences and Engineering (IJCSE), E-ISSN : 2347-2693, Volume-5, Issue-10, Page No. 218-225, 2017

Since, the last three or four years, the field of "big data" has appeared as the new frontier in the wide spectrum of IT-enabled innovations and favorable time allowed by the information revolution. Today, there is a raise necessity to analyses very huge datasets, that have been coined big data, and in need of uniqueness storage and processing infrastructures. MapReduce is a programming model the goal of processing big data in a parallel and distributed manner. In MapReduce, the client describes a map function that processes a key/value pair to procreate a set of intermediate value pairs & key, and a reduce function that merges all intermediate values be associated with the same intermediate key. In this paper, we aimed to demonstrate a close-up view about MapReduce. The MapReduce is a famous framework for data-intensive distributed computing of batch jobs. This is oversimplify fault tolerance, many implementations of MapReduce materialize the overall output of every map and reduce task before it can be consumed. Finally, we also discuss the comparison between RDBMS and MapReduce, and famous scheduling algorithms in this field.

MapReduce: an infrastructure review and research insights

The Journal of Supercomputing, 2019

In the current decade, doing the search on massive data to find "hidden" and valuable information within it is growing. This search can result in heavy processing on considerable data, leading to the development of solutions to process such huge information based on distributed and parallel processing. Among all the parallel programming models, one that gains a lot of popularity is MapReduce. The goal of this paper is to survey researches conducted on the MapReduce framework in the context of its open-source implementation, Hadoop, in order to summarize and report the wide topic area at the infrastructure level. We managed to do a systematic review based on the prevalent topics dealing with MapReduce in seven areas: (1) performance; (2) job/task scheduling; (3) load balancing; (4) resource provisioning; (5) fault tolerance in terms of availability and reliability; (6) security; and (7) energy efficiency. We run our study by doing a quantitative and qualitative evaluation of the research publications' trend which is published between January 1, 2014, and November 1, 2017. Since the MapReduce is a challenge-prone area for researchers who fall off to work and extend with, this work is a useful guideline for getting feedback and starting research.

A Survey on MapReduce Implementations

International Journal of Cloud Applications and Computing, 2016

A distinguished successful platform for parallel data processing MapReduce is attracting a significant momentum from both academia and industry as the volume of data to capture, transform, and analyse grows rapidly. Although MapReduce is used in many applications to analyse large scale data sets, there is still a lot of debate among scientists and researchers on its efficiency, performance, and usability to support more classes of applications. This survey presents a comprehensive review of various implementations of MapReduce framework. Initially the authors give an overview of MapReduce programming model. They then present a broad description of various technical aspects of the most successful implementations of MapReduce framework reported in the literature and discuss their main strengths and weaknesses. Finally, the authors conclude by introducing a comparison between MapReduce implementations and discuss open issues and challenges on enhancing MapReduce.

MapReduce

Communications of the ACM, 2008

MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

Survey of Parallel Data Processing in Context with MapReduce

Computer Science Conference Proceedings, 2011

MapReduce is a parallel programming model and an associated implementation introduced by Google. In the programming model, a user specifies the computation by two functions, Map and Reduce. The underlying MapReduce library automatically parallelizes the computation, and handles complicated issues like data distribution, load balancing and fault tolerance. The original MapReduce implementation by Google, as well as its open-source counterpart, Hadoop, is aimed for parallelizing computing in large clusters of commodity machines.This paper gives an overview of MapReduce programming model and its applications. The author has described here the workflow of MapReduce process. Some important issues, like fault tolerance, are studied in more detail. Even the illustration of working of Map Reduce is given. The data locality issue in heterogeneous environments can noticeably reduce the Map Reduce performance. In this paper, the author has addressed the illustration of data across nodes in a way that each node has a balanced data processing load stored in a parallel manner. Given a data intensive application running on a Hadoop Map Reduce cluster, the auhor has exemplified how data placement is done in Hadoop architecture and the role of Map Reduce in the Hadoop Architecture. The amount of data stored in each node to achieve improved data-processing performance is explained here.

MapReduce: A Parallel Framework

2019

Map Reduce is a practical model used for processing the large scale data that is the huge volume data at a very high speed. It is parallel processing programming model helping in achieving near real time results. Designed efficiently by Google, Map Reduce use Map and Reduce function to accurately produce a large amount of data sets. The division of one large process problem into smaller tasks and issues is carried out by the Map Function. In contrast, the Reduce Function will read and combine the intermediate results to form a single final result. Thus this paper is a brief analysis of the Basics Of Map Reduce and its applications.

MapReduce and Its Applications, Challenges, and Architecture a Comprehensive Review and Directions for Future Research.pdf

Springer, 2017

Profound attention to MapReduce framework has been caught by many different areas. It is presently a practical model for data-intensive applications due to its simple interface of programming, high scalability, and ability to withstand the subjection to flaws. Also, it is capable of processing a high proportion of data in distributed computing environments (DCE). MapReduce, on numerous occasions, has proved to be applicable to a wide range of domains. However, despite the significance of the techniques, applications, and mechanisms of MapReduce, there is no comprehensive study at hand in this regard. Thus, this paper not only analyzes the MapReduce applications and implementations in general, but it also provides a discussion of the differences between varied implementations of MapReduce as well as some guidelines for planning future research.

A Mapreduce Solution for Handling Large Data Efficiently

Machines. Technologies. Materials., 2016

Big data is large volume, heterogeneous, distributed data. Big data applications where data collection has grown continuously, it is expensive to manage, capture or extract and process data using existing software tools. With increasing size of data in data warehouse it is expensive to perform data analysis. In recent years, numbers of computation and data intensive scientific data analyses are established. To perform the large scale data mining analyses so as to meet the scalability and performance requirements of big data, several efficient parallel and concurrent algorithms got applied. For data processing, Big data processing framework relay on cluster computers and parallel execution framework provided by MapReduce. MapReduce is a parallel programming model and an associated implementation for processing and generating large data sets. In this paper, we are going to work around MapReduce, use a MapReduce solution for handling large data efficiently, its advantages, disadvantages and how it can be used in integration with other technology.