HBase or Impala over HBase ?

Why choose Impala over HBase instead of simply using HBase ?

You data are stored in HBase but you would like to use SQL requests and this is not possible as such with HBase, or you want to join data from a HBase table with data from a MySQL table. One solution is to use Impala over HBase.

How does it work ?

Impala does not read HFiles (from HBase) directly, it uses the HBase client API via Java Native Interface (JNI) to query data stored in HBase.

What do you need ?

You need to create a Hive table beside the HBase table and then establish a mapping from HBase to Hive. This step is actually done when you create the Hive table. The user also needs permission on the HBase table.

What to be careful of ?

For the mapping to function, the Hive column mapping to the row key should be a string value column. Other columns in HBase could be used as predicates (used in the WHERE statement) and should be string value as well to get efficient queries. If the row key in HBase is mapped to a non-string column in Hive, say an integer (int), the predicate on row key does not get transformed into scan parameter, which means that the whole table is passed to impala. This should be avoided.

Concurrent queries with Impala and HBase

Impala, since CDH 5, manages concurrent queries by default. A table of the number of nodes needed with respect to the number of queries and the data size is given on the cloudera website. For large data size (>10TB), the recommended number of nodes increases linearly with the number of concurrent queries. For instance, 100 concurrent queries requires 200 nodes on a 15TB data set, while 20 nodes are enough to deal with 10 concurrent queries on the same data set. The number of nodes also increases linearly with data size. HBase is largely more suitable for concurrent tasks. One can manage it by setting up the variable hbase.regionserver.handler.count.

Great features of Impala

Impala does not only allow you to query data using some SQL-type syntax but also allows you to query data in different structures. For instance, one can do joins on HBase and non-HBase tables. Impala makes a great use of RAM and is therefore often much faster than Hive based on the mapreduce paradigm (more CPU expensive than RAM expensive).

Inconvenients with Impala

Impala requires a refresh of the Hive tables when new data are updated/inserted.

About Phoenix

Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache takes the SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets.” They announce milliseconds for small queries and seconds for 10s of millions rows. Phoenix is not supported yet by Cloudera but can be installed on CDH 4 or 5. Note that Apache Phoenix has joined Cloudera Labs, such that Phoenix could be soon integrated by default in next CDH releases.

Written by Jean-Baptiste Poullet

Data analyst – consultant – freelancer
Expert in Bigdata
Founder of RBelgium – R community in Belgium
Owner of the company Stat’Rgy
Contact me at jeanbaptistepoullet@statrgy.com

Posted in Uncategorized.

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *