Identifying the potential of Near Data Computing for Apache Spark (original) (raw)
Related papers
Identifying the potential of Near Data Processing for Apache Spark
While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analyt-ics for being a unified framework for both, batch and stream data processing. There is also a renewed interest in Near Data Processing (NDP) due to technological advancement in the last decade. However, it is not known if NDP archi-tectures can improve the performance of big data processing frameworks such as Apache Spark. In this paper, we build the case of NDP architecture comprising programmable logic based hybrid 2D integrated processing-in-memory and in-storage processing for Apache Spark, by extensive profiling of Apache Spark based workloads on Ivy Bridge Server.
Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study
While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data an-alytics for being a unified framework for both, batch and stream data processing. However, recent studies on micro-architectural characterization of in-memory data analytics are limited to only batch processing workloads. We compare micro-architectural performance of batch processing and stream processing workloads in Apache Spark using hardware performance counters on a dual socket server. In our evaluation experiments, we have found that while batch processing workloads are bounded on the latency of frequent data accesses to DRAM, stream processing workloads are curbed by L1 instruction cache misses. For data accesses we have found that simultaneous multi-threading is effective in hiding the data latencies. We have also observed that (i) data locality on NUMA nodes can improve the performance with up to 12%, (ii) disabling next-line L1-D prefetchers can reduce the execution time by up-to 15% and (iii) multiple small executors can provide up-to 36% speedup over single large executor.
Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters
While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analyt-ics. Recent studies propose scale-in clusters with in-storage processing devices to process big data analytics with Spark However the proposal is based solely on the memory band-width characterization of in-memory data analytics and also does not shed light on the specification of host CPU and memory. Through empirical evaluation of in-memory data analytics with Apache Spark on an Ivy Bridge dual socket server, we have found that (i) simultaneous multi-threading is effective up to 6 cores (ii) data locality on NUMA nodes can improve the performance by 10% on average, (iii) disabling next-line L1-D prefetchers can reduce the execution time by up to 14%, (iv) DDR3 operating at 1333 MT/s is sufficient and (v) multiple small executors can provide up to 36% speedup over single large executor.
Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server
The sheer increase in the volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark defines the state of the art in big data analytics platforms for (i) exploiting data-flow and in-memory computing and (ii) for exhibiting superior scale-out performance on the commodity machines, little effort has been devoted to understanding the performance of in-memory data analytics with Spark on modern scale-up servers. This thesis characterizes the performance of in-memory data analytics with Spark on scale-up servers. Through empirical evaluation of representative benchmark workloads on a dual socket server, we have found that in-memory data analytics with Spark exhibit poor multi-core scalability beyond 12 cores due to thread level load imbalance and work-time inflation (the additional CPU time spent by threads in a multi-threaded computation beyond the CPU time required to perform the same work in a sequential computation). We have also found that workloads are bound by the latency of frequent data accesses to the memory. By enlarging input data size, application performance degrades significantly due to the substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1 cache misses and higher core utilization). For data accesses, we have found that simultaneous multi-threading is effective in hiding the data latencies. We have also observed that (i) data locality on NUMA nodes can improve the performance by 10% on average, (ii) disabling next-line L1-D prefetchers can reduce the execution time by upto 14%. For garbage collection impact, we match memory behavior with the garbage collector to improve the performance of applications between 1.6x to 3x and recommend using multiple small Spark executors that can provide up to 36% reduction in execution time over single large executor. Based on the characteristics of workloads, the thesis envisions near-memory and near storage hardware acceleration to improve the single-node performance of scale-out frameworks like Apache Spark. Using modeling techniques, it estimates the speed-up of 4x for Apache Spark on scale-up servers augmented with near-data accelerators.
Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server
In last decade, data analytics have rapidly pro- gressed from traditional disk-based processing to modern in- memory processing. However, little effort has been devoted in performance enhancement at micro-architecture level. This paper characterizes the performance of in-memory data analytic applications using Apache Spark framework. It uses a single node NUMA machine and identifies the bottlenecks hampering the scalability of the workloads. In doing so, it quantifies the inefficiencies at micro architectural level for various big data applications. Through empirical evaluation, we show that spark workloads do not scale linearly beyond twelve threads, due to work time inflation and thread level load imbalance. Further, at the micro-architecture level, we observe that memory bound latency is one of the major cause of work time inflation
Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads
—While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. However, recent studies on micro-architectural characterization of in-memory data analytics are limited to only batch processing workloads. We compare the micro-architectural performance of batch processing and stream processing workloads in Apache Spark using hardware performance counters on a dual socket server. In our evaluation experiments, we have found that batch processing and stream processing has same micro-architectural behavior in Spark if the difference between two implementations is of micro-batching only. If the input data rates are small, stream processing workloads are front-end bound. However, the front end bound stalls are reduced at larger input data rates and instruction retirement is improved. Moreover, Spark workloads using DataFrames have improved instruction retirement over workloads using RDDs.
How Data Volume Affects Spark Based Data Analytics on a Scale-up Server
Sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not well understood. We present a deep-dive analysis of Spark based applications on a large scale-up server machine. Our analysis reveals that Spark based data analytics are DRAM bound and do not benefit by using more than 12 cores for an executor. By enlarging input data size, application performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1 cache misses and higher core utilization). We match memory behavior with the garbage collector to improve performance of applications between 1.6x to 3x.
Memory Management Approaches in Apache Spark: A Review
2020
In the era of Big Data, processing large amounts of data through data-intensive applications, is presenting a challenge. An in-memory distributed computing system; Apache Spark is often used to speed up big data applications. It caches intermediate data into memory, so there is no need to repeat the computation or reload data from disk when reusing these data later. This mechanism of caching data in memory makes Apache Spark much faster than other systems. When the memory used for caching data is full, the cache replacement policy used by Apache Spark is the Least Recently Used (LRU), however LRU algorithm performs poorly in some workloads. This review is going to give an insight about different replacement algorithms used to address the LRU problems, categorize the different selection factors and provide a comparison between the algorithms in terms of selection factors, performance and the benchmarks used in the research.