When someone says "data modeling", everyone thinks automatically to relational databases, to the process of normalizing the data, to the 3rd normal form etc.; and that is a good practice, it also means that the semesters studying databases paid off and affected your way of thinking and working with data. However, ever since college, things have changed, we do not hear so much about relational databases, although they are still used with predominance in applications. Nowadays, "big data" is in trend, but it is also a situation when more and more applications need to handle: the volume, the velocity, the variety, and the complexity of the data (according to Gartner"s definition).
In this article I am going to approach the dualism of the concepts of normalizing and denormalizing in the big data context, taking into account my experience with MarkLogic (a platform for big data applications).
Data normalization is part of the process of data modeling for creating an application. Most of the time, normalization is a good practice for at least two reasons: it frees your data of integrity issues on alteration tasks (inserts, updates, deletes), it avoids bias towards any query model. In the article "Denormalizing Your Way to Speed and Profit", appears a very interesting comparison between data modeling and philosophy: Descartes"s principle - widely accepted (initially) - of mind and body separation looks an awful like the normalization process - separation of data; Descartes"s error was to separate (philosophically) two parts which were always together. In the same way, after the normalization, the data needs to be brought back together for the sake of the application; the data, which was initially altogether, have been fragmented and now it needs to be recoupled once again - it seems rather redundant, but it is the most used approach from the last decades, especially when working with relational databases. Moreover, even the lexicon and the etymology sustain this practice - the fragmented data are considered "normal".
When it comes to data modeling in the big data context (especially MarkLogic), there is no universally recognized form in which you must fit the data, on the contrary, the schema concept is no longer applied. However, the support offered by the big data platforms for unstructured data must not be confused with the lack of need for data modeling. The raw data must be analyzed from a different point of view in this context, more precisely, from the point of view of the application needs, making in this way the used database application-oriented?. If it were to notice the most frequent operation - read - it may be said that any application is a search application; this is why the modeling process needs to consider the entities which are logically handled (upon the search is made), something like: articles, user information, car specification, etc.
While the normalization breaks the raw data for protocol sake, without considering the functional needs, denormalization is done only for serving the application - of course, with care, excessive denormalization can cause more damage than solutions. The order of the steps in which an application using normalized data is developed seems to respect the waterfall methodology: once the model is established, work starts on the query models and no matter the obtained performance, adjustments are done on the query, or on the database indexes, but never on the model. Having a denormalized database, the relationship between the data model and the query models describe better the agile methodology: if the functional and non-functional requirements are not covered, then changes are made to affect the data as well in order to improve the query performance, until the required result is obtained.
All the arguments which made the normalization so famous, still stand, but big data platforms have developed tools to keep the data integrity and to overcome other problems. The systems for big data are easier to scale for high volumes of data (both horizontally and vertically), which makes the problem of excessive volume generated by denormalization to simply go away; moreover, the extra volume helps improving the overall performance of the searches. The solution to the integrity problem depends on the chosen architecture, but also on the master of the data.
When data denormalization is chosen, it is clear that the chosen solution is an application-oriented data center, but this represents only the data source with which the application directly communicates, not the original source of the data (or the master of data). For the big data systems, there are two options: either they live only in the big data database, either the data has as original source a relational database and using an extract-transform-load (ETL) tool, the data arrives in the big data "warehouse". Having this two options, the possible integrity issues are handled correspondingly.
If the data exists only in the big data system, it is required an instrument for synchronization and integration of the data which was altered. The tools to implement map-reduce and the most often used as they proved to be efficient and they run on commodity hardware. Such sync processes can be triggered as soon as the original changes was applied - when the changes are not too often and there is no possibility of generating a dead-lock; when the changes are more often, it is recommended to use a job running on an established time table.
When the original data are located in a relational database, the effort of maintaining the integrity of data is sustained by the original storage system - which is expected to be normalized. In such situation, you need to invest a lot in the ETL tool to restore the logical structure of the data. Even if the freedom offered by this tool is large, applications needto respect a certain standard of performance and reliability, thus the new changes must reach as soon as possible the big data system; therefore, the risk of excessive denormalization exists, greatly reducing the computational effort on the big data platform.
Having all this evangelization from the above for denormalization, it seems senseless to touch the subject of "joins"; denormalization is a solution to avoid large scale joins - we are in a big data context after all. However, quality attributes, multiple data sources and external protocol compliance can radically reduce the options for modeling/denormalizing. Let"s take a concrete example, the business model for periodic entitlements for the columns in a newspaper; let"s also add the dimension of the model to handle: 45 million of articles, and 9 billion of column-user relations. Each user can purchase entitlements to certain newspapers on a time basis (only a few editions); therefore the join conditions are derived from the match between the identifier of the newspaper and the one in the entitlement and the entitlement period to encapsulate the date of the article. Why denormalization is unsuitable for this scenario? The model for the column needs to contain denormalized information about all the users who are allowed to access it - this would represent a pollution of the column entity, but also extra computational effort on the ETL or map-reduce side and this would result in a degradation of the value of the application; moreover, changes occurring on an entitlement period for a certain user can alter millions of columns and this would trigger a process of reconstructing the consistency for the entitlements…eventually.
In the big data context, the best option for data modeling is denormalization - modern applications need high responsiveness and it does not worth to waste (execution) time to put back together the normalized data in order to offer to the user the logic entities. Of course, complete denormalization is not the best option for encapsulating a big many-to-many, as I have shown in the previous paragraph. To finish in a funny note, according to the title of the article: "normalization is for sissies", and denormalization is the solution.