Design principles for developing stream processing applications (original) (raw)

Data Stream Processing

Learning from Data Streams, 2007

The rapid growth in information science and technology in general and the complexity and volume of data in particular have introduced new challenges for the research community. Many sources produce data continuously. Examples include sensor networks, wireless networks, radio frequency identification (RFID), customer click streams, telephone records, multimedia data, scientific data, sets of retail chain transactions etc. These sources are called data streams. A data stream is an ordered sequence of instances that can be read only once or a small number of times using limited computing and storage capabilities. These sources of data are characterized by being open-ended, flowing at high-speed, and generated by non stationary distributions in dynamic environments.

An Analysis of Stream processing Languages

2009

Stream processing languages and stream processing engines have become more popu- lar as they emerged from several modern data stream intensive applications such as sensor monitoring, stock markets and network moni- toring. This study discusses the characteris- tics and features of the stream processing technology to provide an in-depth high-level guidance and comparison for stream processing systems and its underlying lan- guages and technology with respect to the characteristics and features used by certain applications. The overall aim of this paper is to analyze and to identify the desired features of stream processing languages and to eva- luate a few representative stream processing systems and languages on the basis of those desired features. The analysis could help in the identification of a suitable stream processing technology for particular applica- tions as well as aiding the design and devel- opment of such languages for new, emerging applications.

A stream processing abstraction framework

Frontiers in Big Data

Real-time analysis of large multimedia streams is nowadays made efficient by the existence of several Big Data streaming platforms, like Apache Flink and Samza. However, the use of such platforms is difficult due to the fact that facilities they offer are often too raw to be effectively exploited by analysts. We describe the evolution of RAM3S, a software infrastructure for the integration of Big Data stream processing platforms, to SPAF, an abstraction framework able to provide programmers with a simple but powerful API to ease the development of stream processing applications. By using SPAF, the programmer can easily implement real-time complex analyses of massive streams on top of a distributed computing infrastructure, able to manage the volume and velocity of Big Data streams, thus effectively transforming data into value.

Curracurrong: a stream processing system for distributed environments

2014

Advances in technology have given rise to applications that are deployed on wireless sensor networks (WSNs), the cloud, and the Internet of things. There are many emerging applications, some of which include sensor-based monitoring, web traffic processing, and network monitoring. These applications collect large amount of data as an unbounded sequence of events and process them to generate a new sequences of events. Such applications need an adequate programming model that can process large amount of data with minimal latency; for this purpose, stream programming, among other paradigms, is ideal. However, stream programming needs to be adapted to meet the Prof. Alan Fekete for his advice during this research. My sincere thanks to Cristina Cifuentes, research director at Oracle Labs in Brisbane, Australia, for giving me an opportunity to work with brilliant people and supervising my work on an exciting project during my internship. I would like to acknowledge the academic, technical and administrative staff of the University of Sydney for providing necessary support for this research. I would like to thank my fellow postgraduate research students in the School of Information Technologies. I thank my friends in India, Australia and elsewhere for their support and encouragement throughout.

Stream processing platforms for analyzing big dynamic data

it - Information Technology, 2016

Nowadays, data is produced in every aspect of our lives, leading to a massive amount of information generated every second. However, this vast amount is often too large to be stored and for many applications the information contained in these data streams is only useful when it is fresh. Batch processing platforms like Hadoop MapReduce do not fit these needs as they require to collect data on disk and process it repeatedly. Therefore, modern data processing engines combine the scalability of distributed architectures with the one-pass semantics of traditional stream engines. In this paper, we survey the current state of the art in scalable stream processing from a user perspective. We examine and describe their architecture, execution model, programming interface, and data analysis support as well as discuss the challenges and limitations of their APIs. In this connection, we introduce Piglet, an extended Pig Latin language and code generator that compiles (extended) Pig Latin code ...

A data stream language and system designed for power and extensibility

2006

By providing an integrated and optimized support for user-defined aggregates (UDAs), data stream management systems (DSMS) can achieve superior power and generality while preserving compatibility with current SQL standards. This is demonstrated by the Stream Mill system that, through is Expressive Stream Language (ESL), efficiently supports a wide range of applications-including very advanced ones such as data stream mining, streaming XML processing, time-series queries, and RFID event processing. ESL supports physical and logical windows (with optional slides and tumbles) on both built-in aggregates and UDAs, using a simple framework that applies uniformly to both aggregate functions written in an external procedural languages and those natively written in ESL. The constructs introduced in ESL extend the power and generality of DSMS, and are conducive to UDA-specific optimization and efficient execution as demonstrated by several experiments.

Defining the execution semantics of stream processing engines

Journal of Big Data, 2017

Several modern data-intensive applications need to process large volumes of data on the fly as they are produced. Examples range from credit card fraud detection systems, which analyze massive streams of credit card transactions to identify suspicious patterns, to environmental monitoring applications that continuously analyze sensor data, to click stream analysis of Web sites that identify frequent patterns of interactions. More in general, stream processing is a central requirement in today's information systems. This state of facts pushed the development of several stream processing engines (SPEs) that continuously analyze streams of data to produce new results as new elements enter the streams. Unfortunately, existing SPEs adopt different processing models and standardized execution semantics have not yet emerged. This severely hampers the usability

Data Streams Processing Techniques Data Streams Processing Techniques

Advances in Computational Intelligence and Robotics

Many modern applications in several domains such as sensor networks, financial applications, web logs and click-streams operate on continuous, unbounded, rapid, time-varying streams of data elements. These applications present new challenges that are not addressed by traditional data management techniques. For the query processing of continuous data streams, we consider in particular continuous queries which are evaluated continuously as data streams continue to arrive. The answer to a continuous query is produced over time, always reflecting the stream data seen so far. One of the most critical requirements of stream processing is fast processing. So, parallel and distributed processing would be good solutions. This paper gives (1) analysis to the different continuous query processing techniques; (2) a comparative study for the data streams execution environments; and (3) finally, we propose an integrated system for processing data streams based on cloud computing which apply continuous query optimization technique on cloud environment.

PiCo: A Novel Approach to Stream Data Analytics

Lecture Notes in Computer Science, 2018

In this paper, we present a new C++ API with a fluent interface called PiCo (Pipeline Composition). PiCo's programming model aims at making easier the programming of data analytics applications while preserving or enhancing their performance. This is attained through three key design choices: 1) unifying batch and stream data access models, 2) decoupling processing from data layout, and 3) exploiting a streamoriented, scalable, e cient C++11 runtime system. PiCo proposes a programming model based on pipelines and operators that are polymorphic with respect to data types in the sense that it is possible to re-use the same algorithms and pipelines on diā†µerent data models (e.g., streams, lists, sets, etc.). Preliminary results show that PiCo can attain better performances in terms of execution times and hugely improve memory utilization when compared to Spark and Flink in both batch and stream processing.

Issues in data stream management

2003

Abstract Traditional databases store sets of relatively static records with no pre-defined notion of time, unless timestamp attributes are explicitly added. While this model adequately represents commercial catalogues or repositories of personal information, many current and emerging applications require support for on-line analysis of rapidly changing data streams.