Ethan Millar

5 years ago · 2 min. reading time · visibility ~100 ·

chat Contact the author

thumb_up Relevant message Comment

Explore Metadata In Kind Of Tables In Apache Hive With Hadoop Integration Experts

Hadoop integration professionals will make you learn how to explore metadata in kind of tables in Apache Hive via this post. You can read this post and find how hadoop professionals explore metadata in Hive.

Introduction:

Apache Hadoop is a data framework which can support to process the big data. Hive is data warehouse which build on top of Hadoop. Hive is very powerful in providing the query in big data. Because it creates the mapping metadata to real data in Hadoop distributed file system and can process the data in Map Reduce. Besides, Hive can change the execution engine to process with Spark or Tez in the latest version. Hive have feature which support to do a complex data type with UDFs and a variety of built-in functions. For UDFs in Hive, I will introduce in another blog.Explore Metadata In Kind Of Tables In Apache Hive With Hadoop Integration Experts


In Hive, it has a relational database on the master node (Name node) to keep storing all Hive statuses. For example, when we create a table with command "CREATE TABLE Student(id string) LOCATION 'hdfs://data/sample/';", this table schema is stored in the database as a metadata of Hive.

Assume that we have a partitioned table, the partitions information will be stored in the relational database on name node (so it allows Hive to use lists of partitions and find the data very easily). These things are called 'metadata'. Metadata contains information such as format table, mapping location, file of data etc. And it is stored in memory of name node.

When we drop an internal table (default table), it drops both the data and the metadata in memory from name node. However, when we drop an external table, it only drops the metadata and our data is still keep on the Hadoop distributed file system. That means hive is ignorant of that data now. It does not touch the data itself.

This is very important when working with Hive - Hadoop. In my experiences, I have seen a lot of engineers and developers have this mistake then lost entire the data from our datawarehouse. I hope that this blog will help us understand about metadata concept and kinds of table in Hive.

Environment

Java: JDK 1.7

Cloudera version: CDH5.4.7, please refer to this link: http://www.cloudera.com/downloads/cdh/5-4-7.html

Initial steps

1. We need to prepare some input data file, open the file with vi tool to create a local file:

vi file1

1;Jack

2;Ryan

3;Jean


2. We need to put the local files to Hadoop Distributed File System (HDFS), use this command:

hadoop fs -mkdir -p /data/mydata/sample

hadoop fs -put file1 /data/mydata/sample/


Code walk through and verify the result

This is Hive script which using Hadoop, Hive to create and drop external and default table


DROPTABLE IF EXISTSmydatabase.sample;

CREATE EXTERNAL TABLEmydatabase.sample

(

accountId string,

name string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY'\;'

STORED AS TEXTFILE

LOCATION '/data/mydata/sample/';

DROPTABLE IF EXISTSmydatabase.sample;

CREATETABLEmydatabase.sample

(

accountId string,

name string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY'\;'

STORED AS TEXTFILE

LOCATION '/data/mydata/sample/';


1. We need to check if the local is put to Hadoop distributed file system or not

hadoop fs -ls /data/mydata/sample/

It should be showed the file1 in the /data/mydata/sample


2. We will access to Hive and run this command:


DROPTABLE IF EXISTSmydatabase.sample;

CREATE EXTERNAL TABLEmydatabase.sample

(

accountId string,

name string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY'\;'

STORED AS TEXTFILE

LOCATION '/data/mydata/sample/';


3. We will use this command to check if the table is created or not

show create table mydatabase.sample

-> It should be showed the structure of sample table


4. We will drop the external table with this command

drop table mydatabase.sample


5. We will try again at step 3 and see that the table is not exist anymore


6. Now we will check the datafromhdfs to make sure Hive deleted only metadata or deleted both metadata and data.

hadoop fs -ls /data/mydata/sample/

-> You can see the data still there. Therefore, you can see that external table only delete metadata.


7. Now we will run this command to create default Hive table


DROPTABLE IF EXISTSmydatabase.sample;

CREATETABLEmydatabase.sample

(

accountId string,

name string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY'\;'

STORED AS TEXTFILE

LOCATION '/data/mydata/sample/';


8.  We will follow step 3, 4, 5, 6 to verify how Hive handles metadata and actual data in Hadoop distributed file system

hadoop fs -ls /data/mydata/sample/

-> You can see the data is gone. Therefore, you can see that internal table deletesboth metadata and actual data.


The following steps are the same for load data, indexing, create view in Hive tables (external and internal tables). Hope that you guys can understand how Hive works with kinds of table.


This article is intended by hadoop integration professionals to make people learn how to explore metadata in kind of tables in Apache Hive. You can share your thoughts regarding this post with other readers









"
thumb_up Relevant message Comment
Comments

More articles from Ethan Millar

View blog
3 weeks ago · 3 min. reading time

Which Tips Can Software Testers To Make Ready for the Next Recession?

An economic cycle contains many ups and downs, whi ...

4 years ago · 0 min. reading time

Upgrading or Re-Implementation of Dynamics 365 For Operations

While many Microsoft Dynamics AX technical users a ...

5 years ago · 0 min. reading time

Cleansing data with Pig and storing JSON format to HBase with Pig UDF

This post will explain you the way to clean data a ...