Ethan Millar

7 years ago · 2 min. reading time · ~100 ·

Blogging
>
Ethan blog
>
Explore Metadata In Kind Of Tables In Apache Hive With Hadoop Integration Experts

Explore Metadata In Kind Of Tables In Apache Hive With Hadoop Integration Experts

Hadoop integration professionals will make you learn how to explore metadata in kind of tables in Apache Hive via this post. You can read this post and find how hadoop professionals explore metadata in Hive.

Introduction:

Apache Hadoop is a data framework which can support to process the big data. Hive is data warehouse which build on top of Hadoop. Hive is very powerful in providing the query in big data. Because it creates the mapping metadata to real data in Hadoop distributed file system and can process the data in Map Reduce. Besides, Hive can change the execution engine to process with Spark or Tez in the latest version. Hive have feature which support to do a complex data type with UDFs and a variety of built-in functions. For UDFs in Hive, I will introduce in another blog.

Explore Metadata In Kind Of Tables In Apache Hive With Hadoop Integration ExpertsHADOOP Cluster
(HDFS + Map-Reduce)

 

Name Node Job Tracker
= =

3 &
=m


In Hive, it has a relational database on the master node (Name node) to keep storing all Hive statuses. For example, when we create a table with command "CREATE TABLE Student(id string) LOCATION 'hdfs://data/sample/';", this table schema is stored in the database as a metadata of Hive.

Assume that we have a partitioned table, the partitions information will be stored in the relational database on name node (so it allows Hive to use lists of partitions and find the data very easily). These things are called 'metadata'. Metadata contains information such as format table, mapping location, file of data etc. And it is stored in memory of name node.

When we drop an internal table (default table), it drops both the data and the metadata in memory from name node. However, when we drop an external table, it only drops the metadata and our data is still keep on the Hadoop distributed file system. That means hive is ignorant of that data now. It does not touch the data itself.

This is very important when working with Hive - Hadoop. In my experiences, I have seen a lot of engineers and developers have this mistake then lost entire the data from our datawarehouse. I hope that this blog will help us understand about metadata concept and kinds of table in Hive.

Environment

Java: JDK 1.7

Cloudera version: CDH5.4.7, please refer to this link: http://www.cloudera.com/downloads/cdh/5-4-7.html

Initial steps

1. We need to prepare some input data file, open the file with vi tool to create a local file:

vi file1

1;Jack

2;Ryan

3;Jean


2. We need to put the local files to Hadoop Distributed File System (HDFS), use this command:

hadoop fs -mkdir -p /data/mydata/sample

hadoop fs -put file1 /data/mydata/sample/


Code walk through and verify the result

This is Hive script which using Hadoop, Hive to create and drop external and default table


DROPTABLE IF EXISTSmydatabase.sample;

CREATE EXTERNAL TABLEmydatabase.sample

(

accountId string,

name string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY'\;'

STORED AS TEXTFILE

LOCATION '/data/mydata/sample/';

DROPTABLE IF EXISTSmydatabase.sample;

CREATETABLEmydatabase.sample

(

accountId string,

name string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY'\;'

STORED AS TEXTFILE

LOCATION '/data/mydata/sample/';


1. We need to check if the local is put to Hadoop distributed file system or not

hadoop fs -ls /data/mydata/sample/

It should be showed the file1 in the /data/mydata/sample


2. We will access to Hive and run this command:


DROPTABLE IF EXISTSmydatabase.sample;

CREATE EXTERNAL TABLEmydatabase.sample

(

accountId string,

name string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY'\;'

STORED AS TEXTFILE

LOCATION '/data/mydata/sample/';


3. We will use this command to check if the table is created or not

show create table mydatabase.sample

-> It should be showed the structure of sample table


4. We will drop the external table with this command

drop table mydatabase.sample


5. We will try again at step 3 and see that the table is not exist anymore


6. Now we will check the datafromhdfs to make sure Hive deleted only metadata or deleted both metadata and data.

hadoop fs -ls /data/mydata/sample/

-> You can see the data still there. Therefore, you can see that external table only delete metadata.


7. Now we will run this command to create default Hive table


DROPTABLE IF EXISTSmydatabase.sample;

CREATETABLEmydatabase.sample

(

accountId string,

name string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY'\;'

STORED AS TEXTFILE

LOCATION '/data/mydata/sample/';


8.  We will follow step 3, 4, 5, 6 to verify how Hive handles metadata and actual data in Hadoop distributed file system

hadoop fs -ls /data/mydata/sample/

-> You can see the data is gone. Therefore, you can see that internal table deletesboth metadata and actual data.


The following steps are the same for load data, indexing, create view in Hive tables (external and internal tables). Hope that you guys can understand how Hive works with kinds of table.


This article is intended by hadoop integration professionals to make people learn how to explore metadata in kind of tables in Apache Hive. You can share your thoughts regarding this post with other readers









"
Comments

Articles from Ethan Millar

View blog
2 years ago · 3 min. reading time

An economic cycle contains many ups and downs, which are constantly in effect. Though navigating dur ...

2 years ago · 1 min. reading time

A few years ago, making an HTTP call from Dynamics CRM Services used to be very complex. The develop ...

3 years ago · 4 min. reading time

Java is considered to be a user-friendly language. When it comes to dealing with the data of the cus ...

You may be interested in these jobs

  • Cognizant Technology Solutions

    Sr. Associate

    Found in: beBee S2 IN - 2 hours ago


    Cognizant Technology Solutions Bangalore, India OTHER

    Delivery Manager · Qualification: · B Sc, B Com, Relevant Diploma Degrees (CSC, Electronics), BEResponsibility: · Business / Customer• Understand and articulate complex problems related to the specific technology. · • Provide business development support by assisting in RFP/ RFI ...

  • Kenvue

    Lead Engineer

    Found in: Talent500 IN C2 - 2 hours ago


    Kenvue Bengaluru, India

    S4 HANA Full Stack Developer · Kenvue GCC, Consumer Health is recruiting for an S4 HANA Full Stack developer, located in Skillman, NJ. The Digital Platform Transformation Program is a critical component of the Consumer Health strategy to become a digital first company. Consumer H ...

  • Sisco Jobs

    Robotics Engineer

    Found in: Talent IN C2 - 2 hours ago


    Sisco Jobs Secunderabad, India

    Job Description · Job Title: Robotics Engineer · Location: Remote · Employment Type: Full-time · Role Description: · We are seeking a talented and driven Robotics Engineer to join our innovative team on a full-time basis. As a Robotics Engineer, you will be responsible for design ...