A cache filtering optimisation for queries to massive datasets on tertiary storage (original) (raw)

DICE: An Effective Query Result Cache for Distributed Storage Systems

2010

Due to the proliferation of Internet and Intranet, the distributed storage systems have received a lot of attention. These systems span a large number of machines and store huge amount of data for a lot of users. In the distributed storage systems, a row can be directly accessed using a row key. We concentrate on a problem of efficient processing of queries whose predicate is on a column but not a row key. In this paper, we present a cache management technique, called DICE which maintains query results of range queries to support the next range queries. To accelerate the search time of the cached query results, we use modified Interval Ski Lists. In addition, we devise a novel cache replacement policy since DICE maintains an interval rather than a data item. Since our cache replacement policy considers the properties of intervals, our proposed technique is more efficient than traditional buffer replacement algorithms. Our experimental result demonstrates the efficiency of our proposed technique.

Efficient Algorithms for Multi-file Caching

Lecture Notes in Computer Science, 2004

Multi-File Caching issues arise in applications where a set of jobs are processed and each job requests one or more input files. A given job can only be started if all its input files are preloaded into a disk cache. Examples of applications where Multi-File caching may be required are scientific data mining, bit-sliced indexes, and analysis of sets of vertically partitioned files. The difference between this type of caching and traditional file caching systems is that in this environment, caching and replacement decisions are made based on "combinations of files (file bundles)," rather than single files. In this work we propose new algorithms for Multi-File caching and analyze their performance. Extensive simulations are presented to establish the effectiveness of the Multi-File caching algorithm in terms of job response time and job queue length.

On Cache Replacement Policies for Servicing Mixed Data Intensive Query Workloads

2002

When data analysis applications are employed in a multiclient environment, a data server must service multiple simultaneous queries, each of which may employ complex user-defined data structures and operations on the data. It is then necessary to harness inter-and intra-query commonalities and system resources to improve the performance of the data server. We have developed a framework and customizable middleware to enable reuse of intermediate and final results among queries, through an in-memory active semantic cache and user-defined transformation functions. Since resources such as processing power and memory space are limited on the machine hosting the server, effective scheduling of incoming queries and efficient cache replacement policies are challenging issues that must be addressed. We have worked on the scheduling problem in earlier work, and in this paper we describe and evaluate several cache replacement policies. We present experimental evaluation of the policies on a shared-memory parallel system using two applications from different application domains.

Distributed Caching for Complex Querying of Raw Arrays

arXiv (Cornell University), 2018

As applications continue to generate multi-dimensional data at exponentially increasing rates, fast analytics to extract meaningful results is becoming extremely important. The database community has developed array databases that alleviate this problem through a series of techniques. In-situ mechanisms provide direct access to raw data in the original format-without loading and partitioning. Parallel processing scales to the largest datasets. In-memory caching reduces latency when the same data are accessed across a workload of queries. However, we are not aware of any work on distributed caching of multi-dimensional raw arrays. In this paper, we introduce a distributed framework for cost-based caching of multi-dimensional arrays in native format. Given a set of files that contain portions of an array and an online query workload, the framework computes an effective caching plan in two stages. First, the plan identifies the cells to be cached locally from each of the input files by continuously refining an evolving R-tree index. In the second stage, an optimal assignment of cells to nodes that collocates dependent cells in order to minimize the overall data transfer is determined. We design cache eviction and placement heuristic algorithms that consider the historical query workload. A thorough experimental evaluation over two real datasets in three file formats confirms the superiority-by as much as two orders of magnitude-of the proposed framework over existing techniques in terms of cache overhead and workload execution time.

Access Patterns to Disk Cache for Large Scientific Archive

2020

Large scientific projects are increasing relying on analyses of data for their new discoveries; and a number of different data management systems have been developed to serve this scientific projects. In the work-in-progress paper, we describe an effort on understanding the data access patterns of one of these data management systems, dCache. This particular deployment of dCache acts as a disk cache in front of a large tape storage system primarily containing high-energy physics data. Based on the 15-month dCache logs, the cache is only accessing the tape system once for over 50 file requests, which indicates that it is effective as a disk cache. The on-disk files are repeated used, more than three times a day. We have also identified a number of unusual access patterns that are worth further investigation. CCS CONCEPTS • Information systems → Information storage technologies; • Computing methodologies → Model development and analysis.

Performance Evaluation of Traditional Caching Policies on a Large System with Petabytes of Data

2012 IEEE Seventh International Conference on Networking, Architecture, and Storage, 2012

Caching is widely known to be an effective method for improving I/O performance by storing frequently used data on higher speed storage components. However, most existing studies that focus on caching performance evaluate fairly small files populating a relatively small cache. Few reports are available that detail the performance of traditional cache replacement policies on extremely large caches. Do such traditional caching policies still work effectively when applied to systems with petabytes of data? In this paper, we comprehensively evaluate the performance of several cache policies, which include First-In-First-Out (FIFO), Least Recently Used (LRU) and Least Frequently Used (LFU), on the global satellite imagery distribution application maintained by the U.S. Geological Survey (USGS) Earth Resources Observation and Science Center (EROS). Evidence is presented suggesting traditional caching policies are capable of providing performance gains when applied to large data sets as with smaller data sets. Our evaluation is based on approximately three million real-world satellite images download requests representing global user download behavior since October 2008.

Accurate modeling of cache replacement policies in a data grid

20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings., 2003

Caching techniques have been used to improve the performance gap of storage hierarchies in computing systems. In data intensive applications that access large data files over wide area network environment, such as a data grid,caching mechanism can significantly improve the data access performance under appropriate workloads. In a data grid, it is envisioned that local disk storage resources retain or cache the data files being used by local application. Under a workload of shared access and high locality of reference, the performance of the caching techniques depends heavily on the replacement policies being used. A replacement policy effectively determines which set of objects must be evicted when space is needed. Unlike cache replacement policies in virtual memory paging or database buffering, developing an optimal replacement policy for data grids is complicated by the fact that the file objects being cached have varying sizes and varying transfer and processing costs that vary with time. We present an accurate model for evaluating various replacement policies and propose a new replacement algorithm referred to as "Least Cost Beneficial based on K backward references (LCB-K)." Using this modeling technique, we compare LCB-K with various replacement policies such as Least Frequently Used (LFU), Least Recently Used (LRU), Greedy DualSize (GDS), etc., using synthetic and actual workload of accesses to and from tertiary storage systems. The results obtained show that (LCB-K) and (GDS) are the most cost effective cache replacement policies for storage resource management in data grids.

A Hierarchical Internet Object Cache

1996

This paper discusses the design and performance of a hierarchical proxy-cache designed to make Internet information systems scale better. The design was motivated by our earlier trace-driven simulation study of Internet traffic. We challenge the conventional wisdom that the benefits of hierarchical file caching do not merit the costs, and believe the issue merits reconsideration in the Internet environment.

Client cache management in a distributed object database

1995

A distributed object database stores persistently at servers. Applications run on client machines, fetching objects into a client-side cache of objects. If fetching and cache management are done in terms of objects, rather than fixed-size units such as pages, three problems must be solved: 1. which objects to prefetch 2. how to translate, or swizzle, inter-object references when they are fetched from server to client, and, 3. which objects to displace from the cache. This thesis reports the results of experiments to test various solutions to the problems. The experiments use the runtime system of the Thor distributed object database and benchmarks adapted from the Wisconsin 007 benchmark suite. The thesis establishes the following points: 1. For plausible workloads involving some amount of object fetching, the prefetching policy is likely to have more impact on performance than swizzling policy of cache management policy. 2. A simple breadth-first prefetcher can have performance tha...

Aggregating Caches: A Mechanism for Implicit File Prefetching

Modeling, Analysis, and Simulation On Computer and Telecommunication Systems, 2001

We introduce the aggregating cache, and demonstrate how it can be used to reduce the number of file retrieval requests made by a caching client, improving storage system performance by reducing the impact of latency. The aggregating cache utilizes predetermined groupings of files to perform group retrievals. These groups are maintained by the server, and built dynamically using observed inter-file relationships. Through a simple analytical model we demonstrate how this mechanism has the potential to reduce average latencies by 75% to 82%. Through trace-based simulation we demonstrate that a simple aggregating cache can reduce the number of demand fetches by almost 50%, while simultaneously improving cache hit ratios by up to 5%