Hive Table

Managed and External tables are the two different types of tables in hive used to improve how data is loaded, managed and controlled. In this blog, we will be discussing the types of tables in Hive and the difference between them and how to create those tables and when to use those tables for a particular dataset. Creates a new table and specifies its characteristics. While creating a table, you optionally specify aspects such as:
  •  Whether the table is internal or external.
  •  The columns and associated data types.
  •  The columns used for physically partitioning the data.
  •  The file format for data files.pe
  •  The HDFS directory where the data files are located. 

There are Two Types of Table in Hive

  • Managed Table(Internal Table)
  • External Table

Difference between Managed and External Table

  • When a managed table gets dropped, both the metadata and data get dropped.
  • When external table is dropped, the data associated with it  doesn't get deleted, only the metadata (number of columns, type of columns, terminators, etc.) gets dropped from the Hive metastore.
  • A managed table is stored under the hive.metastore.warehouse.dir path property, by default in a folder path similar to /user/hive/warehouse/databasename.db/tablename/.
  • Hive is manage or internal table where files, metadata and statistics are managed by internal Hive processes.
  • An external table describes the metadata / schema on external files. External table files can be accessed and managed by processes outside of Hive.
  • External tables can access data stored in sources such as Azure Storage Volumes (ASV) or remote HDFS locations. 
  • If the structure or partitioning of an external table is changed, an MSCK REPAIR TABLE table_name statement can be used to refresh metadata information.
  • Use external tables when files are already present or in remote locations, and the files should remain even if the table is dropped.
  • Use managed tables when Hive should manage the lifecycle of the table, or when generating temporary tables.

Hive Table Example

create table emp
(
     empno int,
     ename varchar(20),
     job varchar(20),
     sal float,
     comm float,
    deptno int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n' STORED AS TEXTFILE;

ROW FORMAT ... ‐ fields in the data will be separated by the ‘,’ character as they would be in a csv file.  Note that this is not the default value, which is ‘\001’ . Should the ‘,’ character be part of the data, then it will be escaped with the ‘\’ character. Each row of the datafile will be terminated by the newline character.

STORED AS ... these are in fact the defaults and they equate to the Textfile format. Other formats are available (see documentation for full list)

LOCATION – This is where the data associated with the table is to be stored. It can always be specified, and you can use any hdfs location you have access to. It is more usual to use it for External Tables where you may be using some other system to place the data files into folders.

TBLPROPERTIES – These are generic items which take the format of ‘key’=’value’ pairs. You can make up your own as a form of documentation. There are however some key values that Hive will make use of. In the example above, Hive is being told that the first line in the file contains header information (e.g. row names from a csv file) and that it is to be ignored as not part of the data.

No comments:

Post a Comment