Hive Interview Questions and Answers

Hive Interview Questions and Answers

 

  1. What is Hive?

Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed

File System (HDFS). It is a data warehouse framework for querying and analysis of

data that is stored in HDFS. Hive is an open-source-software that lets programmers

analyze large data sets on Hadoop.

  1. Why Hive?

  • Hive is useful when making data warehouse applications

  • When you are dealing with static data instead of dynamic data

  • When application is on high latency (high response time)

  • When a large data set is maintained

  • When we are using queries instead of scripting

  1. What are the different modes of Hive?

Hive can operate in two modes based on the size of data nodes in Hadoop These

modes are,

  • Local mode

  • Map reduce mode

  1. What ate the key components of Hive Architecture?

Following are the key components of Hive Architecture,

  • User Interface

  • Compiler

  • Metastore

  • Driver

  • Execute Engine

  1. What are the different types of tables available in Hive?

Hive contains two different types of tables. They are,

  • Managed table: In managed table, both the data and schema are under control of Hive

  • External table: In the external table, only the schema is under the control of Hive.

  1. What is Metastore in Hive?

Metastore is a central repository in Hive.  It is used for storing schema information or

metadata in the external database.

  1. What are the different parts of Hive ?

Hive consists of 3 main parts, they are,

  • Hive Clients

  • Hive Services

  • Hive Storage and Computing

  1. What database are supported by Hive?

For single user metadata storage, Hive uses derby database and for multiple user

Metadata or shared Metadata case Hive uses MYSQL.

  1. What are the default read and write classes in Hive?

Default read and write classes in Hive are

  • TextInputFormat/HiveIgnoreKeyTextOutputFormat

  • SequenceFileInputFormat/SequenceFileOutputFormat

  1. Why Hive is not suitable for OLTP systems?

Hive does not provide insert and update function at the row level, so it is not suitable

for OLTP systems

  1. What is the difference between Hbase and Hive?

Following are the common differences between Hbase and Hive,

  • Hive enables most of the SQL queries, but HBase does not allow SQL queries

  • Hive does not support record level insert, update, and delete operations on table

  • Hive is a data warehouse framework whereas HBase is NoSQL database

  • Hive run on the top of MapReduce, HBase runs on the top of HDFS

  1. When you should use Hbase?

  • Data size is huge: When you have tons and millions of records to operate

  • Complete Redesign: When you are moving RDBMS to Hbase, you consider it as a complete re-design then mere just changing the ports

  • SQL-Less commands: You have several features like transactions; inner joins, typed columns, etc.

  • Infrastructure Investment: You need to have enough cluster for Hbase to be really useful

  1. What is ObjectInspector functionality in Hive?

ObjectInspector functionality in Hive is used to analyze the internal structure of the

columns, rows, and complex objects.  It allows to access the internal fields inside the

objects.

  1. Mention what is (HS2) HiveServer2?

It is a server interface that performs following functions.

  • It allows remote clients to execute queries against Hive

  • Retrieve the results of mentioned queries

Some advanced features Based on Thrift RPC in its latest version include

  • Multi-client concurrency

  • Authentication

  1. What is Hive query processor?

Hive query processor allows execution of jobs in the order of dependencies by

converting graph of MapReduce jobs with the execution time framework.

  1. What are the different components of a Hive query processor?

The Hive query processor contains the following components,

  • Logical Plan Generation

  • Physical Plan Generation

  • Execution Engine

  • Operators

  • UDF’s and UDAF’s

  • Optimizer

  • Parser

  • Semantic Analyzer

  • Type Checking

  1. What is Partitions in Hive?

Hive organizes tables into partitions.

  • It is one of the ways of dividing tables into different parts based on partition keys.

  • Partition is helpful when the table has one or more Partition keys.

  • Partition keys are basic elements for determining how the data is stored in the table.

  1. When to choose “Internal Table” in Hive?

In the following situations internal table will be chosen in Hive,

  • If the processing data available in local file system

  • If the Hive wants to manage the complete lifecycle of data including the deletion

  1. When to choose “External Table” in Hive?

In the following situations external table will be chosen in Hive,

  • If the processing data available in HDFS

  • If the files are being used outside of Hive

  1. Mention if we can name view same as the name of a Hive table?

No. The name of a view must be unique compared to all other tables and as views

present in the same database.

  1. What are the views in Hive?

In Hive, Views are Similar to tables. They are generated based on the requirements.

  • We can save any result set data as a view in Hive

  • Usage is similar to as views used in SQL

  • All type of DML operations can be performed on a view

  1. What is Bucket in Hive?

The part of a partition data is called bucket in Hive. Buckets are created in Hive based on

Hash of particular column that is selected in the table.

  1. How to enable buckets in Hive?

The following command is used to enable buckets in Hive,

set.hive.enforce.bucketing=true;

  1. Can you overwrite Hadoop MapReduce configuration in Hive?

Yes, you can overwrite Hadoop MapReduce configuration in Hive.

  1. How to change a column data type in Hive?

The following command is used to change the column data type in Hive,

ALTER TABLE table_name CHANGE column_name column_name new_datatype;

  1. What is the use of “order by” in Hive?

ORDER BY is used to sort all of the data together, which has to pass through one

reducer. Single reducer will used in Order By operation.

  1. What is the use of “order by” in Hive?

SORT BY is used to sort the data within each reducer. Multiple reducers can be used for

SORT BY operation.

  1. When to use explode in Hive?

To convert complex data types into desired table formats, explode is used in Hive.

  1. How to stop a partition form being queried?

A partition can be stopped form being queried by using the ENABLE OFFLINE clause

with ALTER TABLE statement.

  1. Where is table data stored in Apache Hive by default?

hdfs: //namenode_server/user/hive/warehouse

  1. How will you read and write HDFS files in Hive?

i) TextInputFormat- This class is used to read data in plain text file format.

ii) HiveIgnoreKeyTextOutputFormat- This class is used to write data in plain text file format.

iii) SequenceFileInputFormat- This class is used to read data in hadoop SequenceFile format.

iv) SequenceFileOutputFormat- This class is used to write data in hadoop SequenceFile format.