Hive - A Warehousing Solution Over a Map-Reduce Framework (original) (raw)

Hive - a petabyte scale data warehouse using Hadoop

2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), 2010

The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop [1] is a popular open-source map-reduce implementation which is being used in companies like Yahoo, Facebook etc. to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. In this paper, we present Hive, an open-source data warehousing solution built on top of Hadoop. Hive supports queries expressed in a SQL-like declarative language-HiveQL, which are compiled into mapreduce jobs that are executed using Hadoop. In addition, HiveQL enables users to plug in custom map-reduce scripts into queries. The language includes a type system with support for tables containing primitive types, collections like arrays and maps, and nested compositions of the same. The underlying IO libraries can be extended to query data in custom formats. Hive also includes a system catalog-Metastore-that contains schemas and statistics, which are useful in data exploration, query optimization and query compilation. In Facebook, the Hive warehouse contains tens of thousands of tables and stores over 700TB of data and is being used extensively for both reporting and ad-hoc analyses by more than 200 users per month.

Major technical advancements in apache hive

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, 2014

Apache Hive is a widely used data warehouse system for Apache Hadoop, and has been adopted by many organizations for various big data analytics applications. Closely working with many users and organizations, we have identified several shortcomings of Hive in its file formats, query planning, and query execution, which are key factors determining the performance of Hive. In order to make Hive continuously satisfy the requests and requirements of processing increasingly high volumes data in a scalable and efficient way, we have set two goals related to storage and runtime performance in our efforts on advancing Hive. First, we aim to maximize the effective storage capacity and to accelerate data accesses to the data warehouse by updating the existing file formats. Second, we aim to significantly improve cluster resource utilization and runtime performance of Hive by developing a highly optimized query planner and a highly efficient query execution engine. In this paper, we present a community-based effort on technical advancements in Hive. Our performance evaluation shows that these advancements provide significant improvements on storage efficiency and query execution performance. This paper also shows how academic research lays a foundation for Hive to improve its daily operations.

Apache Hive

Proceedings of the 2019 International Conference on Management of Data, 2019

Apache Hive is an open-source relational database system for analytic big-data workloads. In this paper we describe the key innovations on the journey from batch tool to fully fledged enterprise data warehousing system. We present a hybrid architecture that combines traditional MPP techniques with more recent big data and cloud concepts to achieve the scale and performance required by today's analytic applications. We explore the system by detailing enhancements along four main axis: Transactions, optimizer, runtime, and federation. We then provide experimental results to demonstrate the performance of the system for typical workloads and conclude with a look at the community roadmap.

Techniques for Improving Apache Hive Performance Using Relational Data

2021

Hadoop is an open-source map-reduce implementation for storing and manipulating enormous data sets that have been widely adopted. End-users, on the other hand, can find Hadoop challenging to use, especially if they are unfamiliar with the map-reduce approach. Users must write map-reduce programs except for basic tasks like having raw counts or averages. Apache Hive, a Hadoop data warehouse architecture platform for processing structured data, allows users to quickly query, summarise, and interpret Big Data using HiveQL, which is a 3QL-like phrase. It can import and export data from and to the storage file system in a variety of file formats. When petabytes of data need to be processed, Hive's goal is to make it simple and effective. Unlike RDBM3, Hive stores data in a document-based format, so JOIN3 queries reduce output and use a lot of resources. However, by correctly configuring Hive, you can boost efficiency for relational data. In this study, we use a variety of optimizatio...

Importance of Data Distribution on Hive-Based Systems for Query Performance: An Experimental Study

2020 IEEE International Conference on Big Data and Smart Computing (BigComp)

SQL-on-Hadoop systems have been gaining popularity in recent years. One popular example of SQL-on-Hadoop systems is Apache Hive; the pioneer of SQL-on-Hadoop systems. Hive is located on the top of big data stack as an application layer. Besides the application layer, the Hadoop Ecosystem is composed of 3 different main layers: storage, the resource manager and processing engine. The demand from industry has led to the development of new efficient components for each layer. As the ecosystem evolves over time, Hive employed different execution engines too. Understanding the strengths of components is very important in order to exploit the full performance of the Hadoop Ecosystem. Therefore, recent works in the literature study the importance of each layer separately. To the best of our knowledge, the present work is the first work that focuses on the performance of the combination of both the storage layer and the execution engine. In this work, we compare the Hive's query performance by using three different execution engines: MR, Tez and Spark on the skewed/well-balanced data distribution through the full TPC-H benchmark. Our results show the importance of data distribution on the storage layer for overall job performance of SQL-on-Hadoop systems and empirically showed even distribution improves performance up to 48% compared to skewed distribution. Moreover, the present study provides insightful findings by identifying particular SQL query cases that the certain processing engine deals exceptionally well.

CloudETL: Scalable Dimensional ETL for Hive

Extract-Transform-Load (ETL) programs process data into data warehouses (DWs). Rapidly growing data volumes demand systems that scale out. Recently, much attention has been given to MapReduce for parallel handling of massive data sets in cloud environments. Hive is the most widely used RDBMS-like system for DWs on MapReduce and provides scalable analytics. It is, however , challenging to do proper dimensional ETL processing with Hive; e.g., the concept of slowly changing dimensions (SCDs) is not supported (and due to lacking support for UPDATEs, SCDs are complex to handle manually). Also the powerful Pig platform for data processing on MapReduce does not support such dimensional ETL processing. To remedy this, we present the ETL framework CloudETL which uses Hadoop to parallelize ETL execution and to process data into Hive. The user defines the ETL process by means of high-level constructs and transformations and does not have to worry about technical MapReduce details. CloudETL supports different dimensional concepts such as star schemas and SCDs. We present how CloudETL works and uses different performance optimizations including a purpose-specific data placement policy to co-locate data. Further, we present a performance study and compare with other cloud-enabled systems. The results show that CloudETL scales very well and outperforms the dimensional ETL capabilities of Hive both with respect to performance and programmer productivity. For example, Hive uses 3.9 times as long to load an SCD in an experiment and needs 112 statements while CloudETL only needs 4.

ANALYZING VARIETIES OF STRUCTURED, SEMI-STRUCTURED AND UNSTRUCTURED DATA USING HIVEQL

Nowadays, many organizations is focusing on gathering and analyzing data. Analysis of data is necessary in today’s world. Source of data is also not fixed; source of data can be in different form. It can be in structured manner like spread sheets, it can be in semi-structured manner like xml or it can be in unstructured manner like webpage, document or text. Analysis of this kind of diversified data is necessary. Hive Query Language can be used to analyze these varieties of data. This study will represent how hive can be used to analyze diversified data.

Comparison of SQL with HiveQL

2014

SQL is a set based declarative programming language, keyword based language and not an imperative programming language like C or BASIC, for accessing as well as manipulating database systems. This research paper include the basic concept of SQL with its advantages, disadvantages as well as its architecture, and introduction to Apache Hive with its features, advantages, disadvantages and its architecture. Further this research paper also contains introduction to HiveQL as well as comparison of SQL with HiveQL.

An Overview of Apache Pig and Apache Hive

International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 2019

Ever since the enhancement of technology has taken place, the data is growing at an alarming rate. The most prominent factor of data growth is the “Social Media”, leads to the origination of a tremendous amount of data called Big Data. Big Data is a term used for data sets that are extremely large in size as well as complicated to store and process using traditional database processing applications. A saviour to deal with Big Data is “Hadoop” and two major components of Hadoop which are HDFS (Distributed Storage) and Map Reduce(Parallel Processing). Apache Pig and Hive is an essential part of the Hadoop Ecosystem. This paper covers an overview of both Apache Pig and Hive with their architecture. As Hadoop, no doubt is doing tremendously great work by storing and processing the huge volume of data but there are more frameworks now a days to increase the efficiency of Hadoop framework which are basically seen as the layers of Hadoop or a part of Apache Hadoop project. And that is why this paper includes the two most important layers namely Apache Pig and Apache Hive.

Optimization of Multiple Correlated Queries by Detecting Similar Data Source with Hadoop/Hive

Indian Journal of Science and Technology

Objectives: Generated new single Hive Query (HiveQL) by finding the similar type of operation and common data source from two or more input query and compare the total execution time of both queries. Methods/Statistical Analysis: Map Reduce concept of Hadoop Hive is utilized in this paper, a new single query is generated from two or more input queries and 3 sample of data generated with size of 2, 5 and 10 GB using free database generation tool DBGEN. TPC-H queries are executed on this data and total execution time of both the queries is compared to see the performance. Findings: As Hive executes single query at a time, and in this research, multiple queries are provided to hive by converting them into single query. This approach results in reduction of operation while executing the query, which further reduce the execution time and improve the performance of Hive. Hive process the structured data of data warehouse system, so by using this approach, the structured data can be process and analyzed in easily and convenient manner. Structured data is used for processing OLAP (Online Analytical Processing) queries so Hive also helps to process OLAP queries. Hive works in conjunction with Hadoop and it process or execute query on data which is stored on Hadoop. So firstly, Hadoop should be running on the system to use Hive query. This research requires huge amount of data for testing, for this sample data is generated using free data generation tool provided by TPC (Transaction Performance Council), DBGEN. TPC also provide the different types of queries for testing the performance query execution tool, so in this research TPC-H queries are utilized. Application/Improvements: By using the concept which is shown in this research, the total execution time of Hive queries can be reduced drastically and performance of Hive can be increased.