Skip to main content

Hive

What is Hive?

The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.

Hive is built on top of Apache Hadoop.
Access to files stored either directly in Apache HDFS™ or in other data storage systems such as Apache HBase™.
Query execution via Apache Tez™, Apache Spark™, or MapReduce.

Hive provides standard SQL functionality, including many of the later SQL:2003 and SQL:2011 features for analytics. 

There is not a single "Hive format" in which data must be stored. Hive comes with built in connectors for comma and tab-separated values (CSV/TSV) text files, Apache Parquet™, Apache ORC™, and other formats. Users can extend Hive with connectors for other formats.

Hive is not designed for online transaction processing (OLTP) workloads. It is best used for traditional data warehousing tasks.


Architecture




As shown in that figure, the main components of Hive are:

- UI – The user interface for users to submit queries and other operations to the system. As of 2011 the system had a command line interface and a web based GUI was being developed.
- Driver – The component which receives the queries. This component implements the notion of session handles and provides execute and fetch APIs modeled on JDBC/ODBC interfaces.
- Compiler – The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore.
- Metastore – The component that stores all the structure information of the various tables and partitions in the warehouse including column and column type information, the serializers and deserializers necessary to read and write data and the corresponding HDFS files where the data is stored.
- Execution Engine – The component which executes the execution plan created by the compiler. The plan is a DAG of stages. The execution engine manages the dependencies between these different stages of the plan and executes these stages on the appropriate system components.

Hive vs. Presto

Hive is optimized for throughput while Presto is optimized for latency. Hive translates SQL queries into MapReduce, whereas Presto performs in-memory distributed SQL queries.

Comments