When relational databases are not good enough

Often relational databases will be a good fit for your data, especially if the data are well defined in advance and the structure is not expected to change much. Everything goes fine until your data volumes becomes too large. Then it is time to think of another solution, a NoSQL solution such as HBase, Cassandra, etc. OK, but how do I know when the data volume becomes too large ? Well, here is a list of behaviors that may ring you a bell… that it is time to migrate to bigdata:
– you boost your machine: you add memory, CPU to your data server because you think it is slow (vertical scaling)
– you have problem with data replication and consistency in your database cluster (in the case you have more than one machine of course)
– you spend a lot of time improving indexes, optimizing queries, reorganizing joins, throwing out resource-intensive features such as XML processing within a stored procedure
– you denormalize the data to avoid joins, which is BTW just against the philosophy of relational databases, but you think it is OK because you know the user queries

NoSQL are based on horizontal scaling. Scalinp up in that case means adding new machines. The great thing is that common machines can do the job. You can also partition or shard your data with RDBMS but this requires much more maintenance and monitoring, whereas NoSQL solutions do the job for you. Another great feature of NoSQL databases is the schema-free feature. This means that the schema is not fixed as it would be in RDBMS. This is especially useful when you work on data that are not well defined yet, or at least data for which the structure is not defined. NoSQL databases are particularly efficient for sparse data, that means data with a lot of missing elements. In RDBMS, empty cells are created for those missing values. With NoSQL, nothing is written on disk.

Not everything is pink in the bigdata world and one of the great challenge of NoSQL designers is to define the user queries. Indeed, NoSQL database design is mainly dependent on the user queries. RDBMS are much less dependent on those user queries.

A data structure does not have to be just one system, RDBMS or NoSQL. Actually, NoSQL databases often use relational databases to store their metadata. Let’s take a concrete example to illustrate this. As a clinical hospital you may want to store your clinical data in RDBMS, while storing your genomics data that are much larger in NoSQL. Common keys will be shared by both systems. Another example. As a supermarket chain, you may have your product table in a relational database, while having transactions in NoSQL.

As a rule of thumb, if the data can be easily structured without future frequent changes and you are confident that the data volume will remain small enough (<10M rows with limited joins), then you should opt for RDBMS. On the contrary, if the data are not well structured and you expect the data volume to be large, then NoSQL is the solution.

Written by Jean-Baptiste Poullet

Data analyst – consultant – freelancer
Expert in Bigdata
Founder of RBelgium – R community in Belgium
Owner of the company Stat’Rgy
Contact me at jeanbaptistepoullet@statrgy.com

Posted in Uncategorized.

Leave a Reply

Your email address will not be published. Required fields are marked *