Apache Hive

Apache Hive is a data warehouse infrastructure built on top of Apache Hadoop that provides data summarization, ad-hoc query, and analysis of large datasets. It projects structure onto the data in Hadoop and queries that data using an SQL-like language called HiveQL (HQL).

Tables in Hive are similar to tables in a relational database.Data can be accessed via a simple query language and Hive supports overwriting or appending data. Within any particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory.

Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.

Hive also allows programmers familiar with the MapReduce framework to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Hive users have a choice of 3 runtimes when executing SQL queries. Users can choose between Apache Hadoop MapReduce, Apache Tez or Apache Spark frameworks as their execution backend

SQL in Hive

Submitting Queries

Beeline is the CLI for HiveServer2. Beeline is started with JDBC URL of the HiveServer2 GUI Tools

Ambari Hive View Zeppelin DBVisualizer SQLWorkBench

HiveServer2

HiveServer2 is built on Thrift (one of the cross-language serialization/RPC framework which support communication between variety of programming languages) Several bug fixes and concurrency and authentication issues are fixed

Performing Queries in Hive - MapReduce under the cover

Defining a Table

Defining External Table

An external table is a table for which Hive does not manage storage. If you drop an external table, only the definition in Hive is deleted. The data remains. An internal table is a table that Hive manages. If you drop an internal table, both the definition in Hive and the data are deleted.

Location specification

If a LOCATION is not supplied, the table’s data will reside in /apps/hive/warehouse/db_name/table_name

Loading from different file systems

Ambari Hive View

Performance Improvements

Both Stinger and Tez have increased the performance of Hive queries.

The Stinger Initiative enables Hive to support beyond its Batch roots to support interactive queries – all with a common SQL access layer.

Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data.

Hive has always been the defacto standard for SQL in Hadoop and these advances are helping production deployment of Hive across a much wider array of scenarios.

HCatalog

Apache™ HCatalog is a table management layer that exposes Hive metadata to other Hadoop applications. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop Distributed File System (HDFS). It also provides REST APIs so that external systems can access these tables’ metadata.

HIVE Overview

Hive is the data warehouse system for Hadoop and uses the familiar table and SQL metaphors that are used with classic RDBMS solutions
Hive can create, populate and query tables
Views are supported, but are not materialized

References

https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/ch_using-hive.html

Apache Hive

Apache Hive

SQL in Hive

Submitting Queries

HiveServer2

Performing Queries in Hive - MapReduce under the cover

Defining a Table

Defining External Table

Location specification

Loading from different file systems

Ambari Hive View

Performance Improvements

HCatalog

HIVE Overview

References

results matching ""

No results matching ""