Skip to main content

INTRODUCTION TO APACHE HIVE - HADOOP TUTORIALS

What is HIVE?
 Hive Provides
 Hive does NOT provide
Hive versus java and pig
 Word Count Using HIVE?
Sql Vs Hive
Hive Vs Pig
Hadoop MapReduce vs Pig vs Hive
 Hive Components
HIVE Directory structure
 Hive Services
 CLI Hiveserver
 Hwi Metastore
Hive Metastore
Metastore configuration
 Embedded
 Local Remote
 Hive Clients
Hive Concepts 

What is HIVE?

Hive is a data warehouse system built on top of Hadoop.
Hive provides a SQL interface, better known as HiveQL or HQL for short, which allows for easy querying of data in Hadoop.
HQL has its own Data Definition and Data Manipulation languages which are very similar to the DML and DDL.
Hive is not a full database.
Early Hive development work started at Facebook in 2007
Today Hive is an Apache project under Hadoop
                –http://hive.apache.org
DDL: create table, create index, create views.
DML: Select, Where, group by, Join, Order By
Pluggable Functions:
     UDF: User Defined Function
     UDAF: User Defined Aggregate Function
     UDTF: User Defined Table Function 

Hive provide

  • Ability to bring structure to various data formats . 
  • Simple interface for ad hoc querying, analyzing and summarizing large amounts of data.
  • Access to files on various data stores such as HDFS and Hbase.

Hive does NOT provide

  • Hive is not a real-time processing system and is best suited for batch jobs and huge datasets. Think heavy analytics and large aggregations. Latencies are often much higher than in a traditional database system. Hive is schema on read which provides for fast loads and flexibility, at the sacrifice of query time. 
  • Hive lacks full SQL support and does not provide row level inserts, updates or deletes.
  • Hive does not support transactions 
  • limited sub query support.

Hive applications include:


Data Mining Document Indexing Predictive modeling, and Hypothesis testing Customer-facing Business Intelligence (e.g., Google Analytics) Log processing Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. 

Hive versus java and pig

Java 
 -Word Count MapReduce example
 -The Word Count program is meant to read in documents on Hadoop and return a listing of all the words read in along with the number of occurrences of those words.
 -Writing a custom MapReduce program to do this takes 63 lines of code in java. -Hive perform the same task only takes 7 easy lines of code! Pig
 -Another Hive alternative is Apache

Pig. 
 -Pig is a high level programming language, best described as a “data flow language” and not a query language.
 -Pig has powerful data transformation capabilities and is great for ETL.
 -It is not so good for ad-hoc querying.
 -Pig is a nice complement for Hive and the two are often used in tandem in a Hadoop environment. 

Word Count Using HIVE?

step1:Input File -- wordcount.txt



Hi Team ,2007-10-14
Please find some of few easier analysis you can do with the existing data set,2007-10-15
Find list of Reddits and Subreddits combination famous for each month / Year / Overall,2007-10-15
Most active user for each Reddits   of variable date range,2007-10-15
Find some of the spammers in each Reddit,2007-10-15
Spammer of the month,2007-10-15
when some of the following REddit or Sub reddit started,2007-10-15
BigData,2007-10-16
Datascience,2007-10-16
opendata,2007-10-16
add few what ever you like,2007-10-16
Who is the oldest Active user in Reddit,2007-10-16
most used words in Reddit for variable date range,2007-10-16
variable date range   - Year / Month / Week / Day / Hours( if avaliable),2007-10-16
me,2007-10-17

Word Count Using sample text

Step2:Upload File to hortonworks - Filebrowser

click-upload

Upload File to hortonworks - Filebrowser

check file -wordcount.txt

check file -wordcount.txt

Step3:create Table-wordcounttable

create Table-wordcounttable

Click - Button create Table

Click - Button create Table

Step 4: Execute wordcount program



SELECT word, COUNT(*) 
FROM wordcounttable LATERAL VIEW explode(split(body, ' ')) wordcount as word 
GROUP BY word
ORDER BY word;

Explain

1.split(body, ' ') - used to convert data into words
2.after the split words are look like(array of strings) 3.explode - used to convert (array of strings) into multiple rows
4.we have to apply group by to count word occurrences. 
query execution

Step 5: Output

hive query result

Sql Vs Hive

Sql Vs Hive

Hive Vs Pig

Hive Vs Pig

Hadoop MapReduce vs Pig vs Hive

Hadoop MapReduce vs Pig vs Hive

Hive Components


  • Apache Hive helps to analyze large data by using the query language called HiveQL for data source, such as HDFS or HBase.
  • Architecture is divided into Map-Reduce-oriented execution, meta data information for a data storage, and an execution part that receives a query from user or applications for execution 

Hive Components


Metastore: stores system catalog Driver: manages life cycle of HiveQL query as it moves thru’ HIVE; also manages session handle and session statistics
Query compiler: Compiles HiveQL into a directed acyclic graph of map/reduce tasks
Execution engines: The component executes the tasks in proper dependency order; interacts with Hadoop
HiveServer: provides Thrift interface and JDBC/ODBC for integrating other applications.
Client components: CLI, web interface, jdbc/odbc inteface Extensibility interface include SerDe, User Defined Functions and User Defined Aggregate Function. 

HIVE Directory structure

A typical Hive installation has the following directory structure
Lib directory ($HIVE_HOME/lib)
     -First there is a “lib” folder in the Hive installation.
     -The lib folder contains a variety of JAR files. These JAR files contain the Java code that collectively make up the functionality of Hive.
 Bin directory ($HIVE_HOME/bin)
     -This is the location of a variety of Hive scripts that launch various Hive services.
A typical Hive installation has the following directory structure
Lib directory ($HIVE_HOME/lib)
     -First there is a “lib” folder in the Hive installation.
     -The lib folder contains a variety of JAR files. These JAR filescontain the Java code that collectively make up the functionality of Hive.
Bin directory ($HIVE_HOME/bin)
     -This is the location of a variety of Hive scripts that launch various Hive services.
conf directory ($HIVE_HOME/conf)
     -Finally there is the “conf” directory. This directory contains Hive’s configuration files. 

Hive Services

Cli ---The command line interface to Hive (the shell). This is the default service.
Hiveserver --Runs Hive as a server exposing a Thrift service, enabling access from a range of clients written in different languages. Applications using the Thrift, JDBC, and ODBC connectors need to run a Hive server to communicate with Hive. Set the HIVE_PORT environment variable to specify the port the server will listen on (defaults to 10,000).
Hwi--The Hive Web Interface
Jar -- The Hive equivalent to hadoop jar, a convenient way to run Java applications that includes both Hadoop and Hive classes on the classpath. 

Hive Metastore

• To support features like schema(s) and data partitioning Hive keeps its metadata in a Relational Database
           – Packaged with Derby, a lightweight embedded SQL DB
• Default Derby based is good for evaluation an testing
• Schema is not shared between users as each user has their own instance of embedded Derby
• Stored in metastore_db directory which resides in the directory that hive was started from
          – Can easily switch another SQL installation such as MySQL 

Metastore configuration

               The metastore stores the Hive metadata. It consists of two pieces – the service and data store. There’s three configurations you can choose for your metastore.
Embedded
 - which runs the metastore code in the same process with your Hive program and the database that backs the metastore is in the same process as well. - The embedded metastore is likely to be used only in a test environment.
Local 
-The second configuration option is to run the metastore as local, which keeps the metastore code running in process, but moves the database into a separate process that the metastore code communicates with.
Remote 
-The last option is to setup a remote metastore. This option moves the metastore itself out of the process as well.
- The remote metastore can be useful if you wish to share the metastore with other users.
-The remote metastore is the configuration you are most likely to use in a production environment, as it provides some additional security benefits on top of what’s possible with a local metastore.
      A minimum Hive configuration identifies where the metastore is located. If there are no configuration details provided by the user then an embedded Derby database is used.
      A Derby metastore only allows one user at a time, so it may be advantageous to setup Hive to use a more robust database option, such as DB2, MySQL or another JDBC-compliant database. 
Metastore configuration

HIVE CLIENTS

If you run Hive as a server , then there are number of different mechanisms for connecting to it from applications: Thrift Client Makes it easy to run Hive commands from a wide range of programming language. Thrift bindings for Hive are available for C++, Java , PHP, Python and Ruby. JDBC Driver Hive provides a Type 4(pure Java) JDBC driver, defined in the class org.apache.hadoop.hive.jdbc.HiveDriver ODBC Driver The Hive ODBC Driver allows applications that support the ODBC protocol to connect to Hive. It is still in development so you should refer to the latest instructions on the hive. 

Hive Concepts

Re-used from Relational Databases
– Database: Set of Tables, used for name conflicts resolution
– Table: Set of Rows that have the same schema (same columns)
– Row: A single record; a set of columns
– Column: provides value and type for a single value 
Hive Concepts

Comments