Hive

Apache Hive is a data warehousing infrastructure based on the Hadoop framework that is perfectly suitable for Data summarization, Data analysis, and Data querying. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware. Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language. This is basically a wrapper written on top of map reduce programming layer that makes querying and analyzing easy.

Hive engine compiles these queries into Map-Reduce jobs to be executed on Hadoop. In addition, custom Map-Reduce scripts can also be plugged into queries. Hive operates on data stored in tables which consists of primitive data types and collection data types like arrays and maps.
Apache Hive
  • Data Summarization
  • Data Analysis
  • Data Querying
Hive is getting immense popularity because tables in Hive are similar to relational databases. If you know how to work with SQL then working with Hive would be a cakewalk for you. A plenty of users are simultaneously querying data using HQL worldwide. Hive allows you to provide structure on largely unstructured data. After you define the structure, you can use Hive to query the data without knowledge of Java or Map Reduce.
Characteristics of Hive
  • The Apache Hive uses distributed storage.
  • Hive as data warehouse designed for managing and querying only structured data that is stored in tables.
  • Hive provides tools to enable easy data extract/transform/load (ETL)
  • It supports different  file formats.
  • By using Hive, we can access files stored in Hadoop Distributed File System (HDFS is used to querying and managing large datasets residing in) or in other data storage systems such as Apache HBase.
Why use Apache Hive?
Apache Hive is mainly used for data querying, analysis, and summarization. It helps improve developers’ productivity which usually comes at the cost of increasing latency. Hive is a variant of SQL and a very good one indeed. It stands tall when compared to SQL systems implemented in databases. Hive has many user-defined functions that offer effective ways of solving problems. It is easily possible to connect Hive queries to various Hadoop packages like RHive, RHipe, and even Apache Mahout. Also, it greatly helps the developer community work with complex analytical processing and challenging data formats.

Data warehouse refers to a system used for reporting and data analysis. What this means is inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information and suggesting conclusions. Data analysis has multiple aspects and approaches, encompassing diverse techniques under a variety of names in different domains.

Hive allows users to simultaneously access data and, at the same time, increases the response time, i.e., the time a system or a functional unit takes to react to a given input. In fact, Hive typically has a much faster response time than most other types of queries. Hive is also highly flexible as more commodities can easily be added in response to adding more clusters of data without any drop in performance.
File systems supported by Hive are:
  • Flat files or text files
  • Sequence files consisting of binary key–value pairs
  • RCFiles that store columns of a table in a columnar database
  • Parquet File
  • ORC File
  • Avaro File
Limitations of Hive
  • Hive is not designed for Online transaction processing (OLTP ), it is only used for the Online Analytical Processing(OLAP).
  • Hive supports overwriting or apprehending data, but not updates and deletes.
  • it does not support real-time queries.
  • it provides limited query support.
Modes of Hive
Hive is functioned in two major modes which are described below. These modes are depended on the size of data nodes in Hadoop.
  • Local Mode –It is used, when the Hadoop is built under pseudo mode which have only one data node, when the data size is smaller in term of restricted to single local machine, and when processing will be faster on smaller datasets existing in the local machine.
  • Map Reduce Mode – It is used, when Hadoop is built with multiple data nodes and data is divided across various nodes, it will function on huge datasets and query is executed parallelly, and to achieve enhanced performance in processing large datasets.

No comments:

Post a Comment