An efficient approach for low latency processing in stream data (original) (raw)

Big Data Stream Processing: Latency and Throughput

In recent years, continuously arriving big data streams needs to be processed and respond instantaneously. In several crucial applications, it is expected to investigate and evaluate such streaming data in real time. One of the basic task of any streaming application is processing arriving data from scattered sources and generate an output promptly. The key deliberations for that desired task is: Latency and Throughput. Hence Dealing with stream imperfections such as late data, lost data and out of order data, becomes a significant research in big data stream processing. We have performed experiments for prediction on the stock market data, along with considering the price of US dollar, oil and gold as essential dependent parameters. Since the source of these dependent parameters are distributed, delay in any parameter introduce different types of latency and hence lower down the throughput of stream processing system. In this paper, we have presented the way to deal with latency and throughput with the use of appropriate pipeline and watermark in big data stream processing.

Real-time stream processing for Big Data

it - Information Technology, 2016

With the rise of the web 2.0 and the Internet of things, it has become feasible to track all kinds of information over time, in particular fine-grained user activities and sensor data on their environment and even their biometrics. However, while efficiency remains mandatory for any application trying to cope with huge amounts of data, only part of the potential of today's Big Data repositories can be exploited using traditional batch-oriented approaches as the value of data often decays quickly and high latency becomes unacceptable in some applications. In the last couple of years, several distributed data processing systems have emerged that deviate from the batch-oriented approach and tackle data items as they arrive, thus acknowledging the growing importance of timeliness and velocity in Big Data analytics. In this article, we give an overview over the state of the art of stream processors for low-latency Big Data analytics and conduct a qualitative comparison of the most po...

A New Architecture for Real Time Data Stream Processing

International Journal of Advanced Computer Science and Applications, 2017

Processing a data stream in real time is a crucial issue for several applications, however processing a large amount of data from different sources, such as sensor networks, web traffic, social media, video streams and other sources, represents a huge challenge. The main problem is that the big data system is based on Hadoop technology, especially MapReduce for processing. This latter is a high scalability and fault tolerant framework. It also processes a large amount of data in batches and provides perception blast insight of older data, but it can only process a limited set of data. MapReduce is not appropriate for real time stream processing, and is very important to process data the moment they arrive at a fast response and a good decision making. Ergo the need for a new architecture that allows real-time data processing with high speed along with low latency. The major aim of the paper at hand is to give a clear survey of the different open sources technologies that exist for real-time data stream processing including their system architectures. We shall also provide a brand new architecture which is mainly based on previous comparisons of real-time processing powered with machine learning and storm technology.

Scalable and Low-Latency Data Processing with Stream MapReduce

2011 IEEE Third International Conference on Cloud Computing Technology and Science, 2011

We present StreamMapReduce, a data processing approach that combines ideas from the popular MapReduce paradigm and recent developments in Event Stream Processing. We adopted the simple and scalable programming model of MapReduce and added continuous, low-latency data processing capabilities previously found only in Event Stream Processing systems. This combination leads to a system that is efficient and scalable, but at the same time, simple from the user's point of view. For latency-critical applications, our system allows a hundred-fold improvement in response time. Notwithstanding, when throughput is considered, our system offers a tenfold pernode throughput increase in comparison to Hadoop. As a result, we show that our approach addresses classes of applications that are not supported by any other existing system and that the MapReduce paradigm is indeed suitable for scalable processing of real-time data streams.

Data Stream Processing

Learning from Data Streams, 2007

The rapid growth in information science and technology in general and the complexity and volume of data in particular have introduced new challenges for the research community. Many sources produce data continuously. Examples include sensor networks, wireless networks, radio frequency identification (RFID), customer click streams, telephone records, multimedia data, scientific data, sets of retail chain transactions etc. These sources are called data streams. A data stream is an ordered sequence of instances that can be read only once or a small number of times using limited computing and storage capabilities. These sources of data are characterized by being open-ended, flowing at high-speed, and generated by non stationary distributions in dynamic environments.

The 8 Requirements of Real-Time Stream Processing

Applications that require real-time processing of high-volume data steams are pushing the limits of traditional data processing infrastructures. These stream-based applications include market feed processing and electronic trading on Wall Street, network and infrastructure monitoring, fraud detection, and command and control in military environments. Furthermore, as the "sea change" caused by cheap micro-sensor technology takes hold, we expect to see everything of material significance on the planet get "sensor-tagged" and report its state or location in real time. This sensorization of the real world will lead to a "green field" of novel monitoring and control applications with high-volume and low-latency processing requirements.

Data Streams Processing Techniques Data Streams Processing Techniques

Advances in Computational Intelligence and Robotics

Many modern applications in several domains such as sensor networks, financial applications, web logs and click-streams operate on continuous, unbounded, rapid, time-varying streams of data elements. These applications present new challenges that are not addressed by traditional data management techniques. For the query processing of continuous data streams, we consider in particular continuous queries which are evaluated continuously as data streams continue to arrive. The answer to a continuous query is produced over time, always reflecting the stream data seen so far. One of the most critical requirements of stream processing is fast processing. So, parallel and distributed processing would be good solutions. This paper gives (1) analysis to the different continuous query processing techniques; (2) a comparative study for the data streams execution environments; and (3) finally, we propose an integrated system for processing data streams based on cloud computing which apply continuous query optimization technique on cloud environment.

Overload Management in Data Stream Processing Systems with Latency Guarantees

ABSTRACT Stream processing systems are becoming increasingly important to analyse real-time data generated by modern applications such as online social networks. Their main characteristic is to produce a continuous stream of fresh results as new data are being generated at real-time. Resource provisioning of stream processing systems is difficult due to time-varying workload data that induce unknown resource demands over time.

A review on big data real-time stream processing and its scheduling techniques

International Journal of Parallel, Emergent and Distributed Systems, 2019

Over the last decade, several interconnected disruptions have happened in the large scale distributed and parallel computing landscape. The volume of data currently produced by various activities of the society has never been so big and is generated at an increasing speed. Data that is received in real-time can become way too valuable at the time it arrives and supports valuable decision making. Systems for managing data streams is not a recently developed concept but its becoming more important due to the multiplication of data stream sources in the context of IoT. This paper refers to the unique processing challenges posed by the nature of streams, and the related mechanisms used to face them in the big data era. Several cloud systems emerged to enable distributed processing of streams of big data. Distributed stream management systems (DSMS) along with their strengths and limitations are presented and compared. Computations in these systems demand elaborate orchestration over a collection of machines. Consequently, a classification and literature review on these systems' scheduling techniques and their enhancements is also provided.

A Software Chain Approach to Big Data Stream Processing and Analytics

2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems, 2015

Big Data Stream processing is among the most important computing trends nowadays. The growing interest on Big Data Stream processing comes from the need of many Internet-based applications that generate huge data streams, whose processing can serve to extract useful analytics and inform for decision making systems. For instance, an IoTbased monitoring systems for a supply-chain, can provide real time data analytics for the business delivery performance. The challenges of processing Big Data Streams reside on coping with real-time processing of an unbounded stream of data, that is, the computing system should be able to compute at high throughput to accommodate the high data stream rate generation in input. Clearly, the higher the data stream rate, the higher should be the throughput to achieve consistency of the processing results (e.g. preserving the order of events in the data stream). In this paper we show how to map the data stream processing phases (from data generation to final results) to a software chain architecture, which comprises five main components: sensor, extractor, parser, formatter and outputter. We exemplify the approach using the Yahoo!S4 for processing the Big Data Stream from FlightRadar24 global flight monitoring system.