M3: Stream Processing on Main-Memory MapReduce (original) (raw)

Scalable and Low-Latency Data Processing with Stream MapReduce

2011 IEEE Third International Conference on Cloud Computing Technology and Science, 2011

We present StreamMapReduce, a data processing approach that combines ideas from the popular MapReduce paradigm and recent developments in Event Stream Processing. We adopted the simple and scalable programming model of MapReduce and added continuous, low-latency data processing capabilities previously found only in Event Stream Processing systems. This combination leads to a system that is efficient and scalable, but at the same time, simple from the user's point of view. For latency-critical applications, our system allows a hundred-fold improvement in response time. Notwithstanding, when throughput is considered, our system offers a tenfold pernode throughput increase in comparison to Hadoop. As a result, we show that our approach addresses classes of applications that are not supported by any other existing system and that the MapReduce paradigm is indeed suitable for scalable processing of real-time data streams.

MapReduce online

Proceedings of the …, 2010

MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce materialize the entire output of each map and reduce task before it can be consumed. In this paper, we propose a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, and can reduce completion times and improve system utilization for batch jobs as well. We present a modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see "early returns" from a job as it is being computed. Our Hadoop Online Prototype (HOP) also supports continuous queries, which enable MapReduce programs to be written for applications such as event monitoring and stream processing. HOP retains the fault tolerance properties of Hadoop and can run unmodified user-defined MapReduce programs.

Future of Big Data beyond Batch Processing

In recent years, big data are generated from a variety of sources, and there is an enormous demand for storing, managing, processing, and querying on big data. The MapReduce framework and its open source implementation Hadoop, has proven itself as the de facto solution for processing large amounts of data in parallel, and is intrinsically designed for batch processing and high throughput jobs. Although Hadoop has proven as de facto solution for batch jobs, there is growing demand for non-batch applications like: real-time queries, interactive jobs, and big data streams. Since Hadoop is not suitable for these non-batch jobs, new solutions are proposed to meet these new challenges. In this paper, we discuss the strength, features, and shortcomings of the standard MapReduce framework and its open source implementation Hadoop. In addition; we have discussed the significant extensions of MapReduce. Further, we have considered two categories of these solutions: real-time processing, and stream processing of big data. For each category, we have included paradigms and strengths.

Real-time stream processing for Big Data

it - Information Technology, 2016

With the rise of the web 2.0 and the Internet of things, it has become feasible to track all kinds of information over time, in particular fine-grained user activities and sensor data on their environment and even their biometrics. However, while efficiency remains mandatory for any application trying to cope with huge amounts of data, only part of the potential of today's Big Data repositories can be exploited using traditional batch-oriented approaches as the value of data often decays quickly and high latency becomes unacceptable in some applications. In the last couple of years, several distributed data processing systems have emerged that deviate from the batch-oriented approach and tackle data items as they arrive, thus acknowledging the growing importance of timeliness and velocity in Big Data analytics. In this article, we give an overview over the state of the art of stream processors for low-latency Big Data analytics and conduct a qualitative comparison of the most po...

D3-MapReduce: Towards MapReduce for Distributed and Dynamic Data Sets

2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity), 2015

Since its introduction in 2004 by Google, MapReduce has become the programming model of choice for processing large data sets. Although MapReduce was originally developed for use by web enterprises in large data-centers, this technique has gained a lot of attention from the scientific community for its applicability in large parallel data analysis (including geographic, high energy physics, genomics, etc.). So far MapReduce has been mostly designed for batch processing of bulk data. The ambition of D 3-MapReduce is to extend the MapReduce programming model and propose efficient implementation of this model to: i) cope with distributed data sets, i.e. that span over multiple distributed infrastructures or stored on network of loosely connected devices; ii) cope with dynamic data sets, i.e. which dynamically change over time or can be either incomplete or partially available. In this paper, we draw the path towards this ambitious goal. Our approach leverages Data Life Cycle as a key concept to provide MapReduce for distributed and dynamic data sets on heterogeneous and distributed infrastructures. We first report on our attempts at implementing the MapReduce programming model for Hybrid Distributed Computing Infrastructures (Hybrid DCIs). We present the architecture of the prototype based on BitDew, a middleware for large scale data management, and Active Data, a programming model for data life cycle management. Second, we outline the challenges in term of methodology and present our approaches based on simulation and emulation on the Grid'5000 experimental testbed. We conduct performance evaluations and compare our prototype with Hadoop, the industry reference MapReduce implementation. We present our work in progress on dynamic data sets that has lead us to implement an incremental MapReduce framework. Finally, we discuss our achievements and outline the challenges that remain to be addressed before obtaining a complete D 3-MapReduce environment.

A Survey on MapReduce Implementations

International Journal of Cloud Applications and Computing, 2016

A distinguished successful platform for parallel data processing MapReduce is attracting a significant momentum from both academia and industry as the volume of data to capture, transform, and analyse grows rapidly. Although MapReduce is used in many applications to analyse large scale data sets, there is still a lot of debate among scientists and researchers on its efficiency, performance, and usability to support more classes of applications. This survey presents a comprehensive review of various implementations of MapReduce framework. Initially the authors give an overview of MapReduce programming model. They then present a broad description of various technical aspects of the most successful implementations of MapReduce framework reported in the literature and discuss their main strengths and weaknesses. Finally, the authors conclude by introducing a comparison between MapReduce implementations and discuss open issues and challenges on enhancing MapReduce.

Parallel data processing with MapReduce

ACM SIGMOD Record, 2012

A prominent parallel data processing tool MapReduce is gaining significant momentum from both industry and academia as the volume of data to analyze grows rapidly. While MapReduce is used in many areas where massive data analysis is required, there are still debates on its performance, efficiency per node, and simple abstraction. This survey intends to assist the database and open source communities in understanding various technical aspects of the MapReduce framework. In this survey, we characterize the MapReduce framework and discuss its inherent pros and cons. We then introduce its optimization strategies reported in the recent literature. We also discuss the open issues and challenges raised on parallel data analysis with MapReduce.

Beyond Hadoop: The Paradigm Shift of Data From Stationary to Streaming Data for Data Analytics

2017

The paradigm shift of data from static to fast flowing data is an important move in the industry, to accommodate growing size of data. The velocity and volume of data are continuing to expand which has started to make its impact in business and other applications of Big Data. The paper describes the paradigm shift of data from static data to streaming data for data analytics beyond Hadoop. It describes how the first generation of Hadoop applications were largely built for batch-oriented paradigm . Streaming data is essentially different from traditional data handling patterns and comes with its own set of challenges and requirements. New applications such as Storm, Flume, Kafka, and other technologies are evolving to bring in an era of real-time analytics Data is generated incessantly from thousands of sources simultaneously and it can be of various type such as log files, mobile and web data, transaction etc. The sections of my paper are Introduction followed by Streaming data, Had...