HBase, Parquet or Avro ?

How to choose between HBase, Parquet and Avro ?

First, if you need to update your data, go with HBase. If part of the data should be updated and the other part not, then you may think of a hybrid solution. With HBase you can keep old version of a “cell”. Moreover, you can use time-to-live (TTL) feature to delete old data automatically.

If you need to scan many rows, HBase is not really suitable (e.g, if you do a “COUNT DISTINCT”). In that case AVRO will be faster and Parquet the fastest especially if you restrict your queries on some columns. Let’s take an example, you have logs of metrics coming in to feed HBase every second. Suppose you are interested in analytics based on minutes. HBase should do it because you restrict your analysis to a limited amount of data. Now, you want to make some analysis on a daily basis or worse on a monthly basis. Then HBase will not be suitable anymore.But you need those analysis. OK you can use Parquet. Great! But not real time and you need real time queries. Then you need to aggregate your data. You may do it at the frequency(ies) you define in your KPIs or you can partition your data and pre-aggregate them with Flume as they are coming in before loading them aggregated into HBase.

Avro is a row-based storage format for Hadoop.
Parquet is a column-based storage format for Hadoop.
If your use case typically scans or retrieves all of the fields in a row in each query, Avro is usually the best choice.
If your dataset has many columns, and your use case typically involves working with a subset of those columns rather than entire records, Parquet is optimized for that kind of work.

Written by Jean-Baptiste Poullet

Data analyst – consultant – freelancer
Expert in Bigdata
Founder of RBelgium – R community in Belgium
Owner of the company Stat’Rgy
Contact me at jeanbaptistepoullet@statrgy.com

Posted in Uncategorized.

Leave a Reply

Your email address will not be published. Required fields are marked *