ABONAMENTE VIDEO REDACȚIA
RO
EN
NOU
Numărul 150
Numărul 149 Numărul 148 Numărul 147 Numărul 146 Numărul 145 Numărul 144 Numărul 143 Numărul 142 Numărul 141 Numărul 140 Numărul 139 Numărul 138 Numărul 137 Numărul 136 Numărul 135 Numărul 134 Numărul 133 Numărul 132 Numărul 131 Numărul 130 Numărul 129 Numărul 128 Numărul 127 Numărul 126 Numărul 125 Numărul 124 Numărul 123 Numărul 122 Numărul 121 Numărul 120 Numărul 119 Numărul 118 Numărul 117 Numărul 116 Numărul 115 Numărul 114 Numărul 113 Numărul 112 Numărul 111 Numărul 110 Numărul 109 Numărul 108 Numărul 107 Numărul 106 Numărul 105 Numărul 104 Numărul 103 Numărul 102 Numărul 101 Numărul 100 Numărul 99 Numărul 98 Numărul 97 Numărul 96 Numărul 95 Numărul 94 Numărul 93 Numărul 92 Numărul 91 Numărul 90 Numărul 89 Numărul 88 Numărul 87 Numărul 86 Numărul 85 Numărul 84 Numărul 83 Numărul 82 Numărul 81 Numărul 80 Numărul 79 Numărul 78 Numărul 77 Numărul 76 Numărul 75 Numărul 74 Numărul 73 Numărul 72 Numărul 71 Numărul 70 Numărul 69 Numărul 68 Numărul 67 Numărul 66 Numărul 65 Numărul 64 Numărul 63 Numărul 62 Numărul 61 Numărul 60 Numărul 59 Numărul 58 Numărul 57 Numărul 56 Numărul 55 Numărul 54 Numărul 53 Numărul 52 Numărul 51 Numărul 50 Numărul 49 Numărul 48 Numărul 47 Numărul 46 Numărul 45 Numărul 44 Numărul 43 Numărul 42 Numărul 41 Numărul 40 Numărul 39 Numărul 38 Numărul 37 Numărul 36 Numărul 35 Numărul 34 Numărul 33 Numărul 32 Numărul 31 Numărul 30 Numărul 29 Numărul 28 Numărul 27 Numărul 26 Numărul 25 Numărul 24 Numărul 23 Numărul 22 Numărul 21 Numărul 20 Numărul 19 Numărul 18 Numărul 17 Numărul 16 Numărul 15 Numărul 14 Numărul 13 Numărul 12 Numărul 11 Numărul 10 Numărul 9 Numărul 8 Numărul 7 Numărul 6 Numărul 5 Numărul 4 Numărul 3 Numărul 2 Numărul 1
×
▼ LISTĂ EDIȚII ▼
Numărul 20
Abonament PDF

New business development analysis

Ioana Armean
Business Analyst
@Imprezzio Global



MANAGEMENT

When someone says "data modeling", everyone thinks automatically to relational databases, to the process of normalizing the data, to the 3rd normal form etc.; and that is a good practice, it also means that the semesters studying databases paid off and affected your way of thinking and working with data. However, ever since college, things have changed, we do not hear so much about relational databases, although they are still used with predominance in applications. Nowadays, "big data" is in trend, but it is also a situation when more and more applications need to handle: the volume, the velocity, the variety, and the complexity of the data (according to Gartner"s definition).

In this article I am going to approach the dualism of the concepts of normalizing and denormalizing in the big data context, taking into account my experience with MarkLogic (a platform for big data applications).

About normalization

Data normalization is part of the process of data modeling for creating an application. Most of the time, normalization is a good practice for at least two reasons: it frees your data of integrity issues on alteration tasks (inserts, updates, deletes), it avoids bias towards any query model. In the article "Denormalizing Your Way to Speed and Profit", appears a very interesting comparison between data modeling and philosophy: Descartes"s principle - widely accepted (initially) - of mind and body separation looks an awful like the normalization process - separation of data; Descartes"s error was to separate (philosophically) two parts which were always together. In the same way, after the normalization, the data needs to be brought back together for the sake of the application; the data, which was initially altogether, have been fragmented and now it needs to be recoupled once again - it seems rather redundant, but it is the most used approach from the last decades, especially when working with relational databases. Moreover, even the lexicon and the etymology sustain this practice - the fragmented data are considered "normal".

About denormalization

When it comes to data modeling in the big data context (especially MarkLogic), there is no universally recognized form in which you must fit the data, on the contrary, the schema concept is no longer applied. However, the support offered by the big data platforms for unstructured data must not be confused with the lack of need for data modeling. The raw data must be analyzed from a different point of view in this context, more precisely, from the point of view of the application needs, making in this way the used database application-oriented?. If it were to notice the most frequent operation - read - it may be said that any application is a search application; this is why the modeling process needs to consider the entities which are logically handled (upon the search is made), something like: articles, user information, car specification, etc.

While the normalization breaks the raw data for protocol sake, without considering the functional needs, denormalization is done only for serving the application - of course, with care, excessive denormalization can cause more damage than solutions. The order of the steps in which an application using normalized data is developed seems to respect the waterfall methodology: once the model is established, work starts on the query models and no matter the obtained performance, adjustments are done on the query, or on the database indexes, but never on the model. Having a denormalized database, the relationship between the data model and the query models describe better the agile methodology: if the functional and non-functional requirements are not covered, then changes are made to affect the data as well in order to improve the query performance, until the required result is obtained.

All the arguments which made the normalization so famous, still stand, but big data platforms have developed tools to keep the data integrity and to overcome other problems. The systems for big data are easier to scale for high volumes of data (both horizontally and vertically), which makes the problem of excessive volume generated by denormalization to simply go away; moreover, the extra volume helps improving the overall performance of the searches. The solution to the integrity problem depends on the chosen architecture, but also on the master of the data.

Solving integrity issues on denormalization

When data denormalization is chosen, it is clear that the chosen solution is an application-oriented data center, but this represents only the data source with which the application directly communicates, not the original source of the data (or the master of data). For the big data systems, there are two options: either they live only in the big data database, either the data has as original source a relational database and using an extract-transform-load (ETL) tool, the data arrives in the big data "warehouse". Having this two options, the possible integrity issues are handled correspondingly.

If the data exists only in the big data system, it is required an instrument for synchronization and integration of the data which was altered. The tools to implement map-reduce and the most often used as they proved to be efficient and they run on commodity hardware. Such sync processes can be triggered as soon as the original changes was applied - when the changes are not too often and there is no possibility of generating a dead-lock; when the changes are more often, it is recommended to use a job running on an established time table.

When the original data are located in a relational database, the effort of maintaining the integrity of data is sustained by the original storage system - which is expected to be normalized. In such situation, you need to invest a lot in the ETL tool to restore the logical structure of the data. Even if the freedom offered by this tool is large, applications needto respect a certain standard of performance and reliability, thus the new changes must reach as soon as possible the big data system; therefore, the risk of excessive denormalization exists, greatly reducing the computational effort on the big data platform.

Denormalization and joins

Having all this evangelization from the above for denormalization, it seems senseless to touch the subject of "joins"; denormalization is a solution to avoid large scale joins - we are in a big data context after all. However, quality attributes, multiple data sources and external protocol compliance can radically reduce the options for modeling/denormalizing. Let"s take a concrete example, the business model for periodic entitlements for the columns in a newspaper; let"s also add the dimension of the model to handle: 45 million of articles, and 9 billion of column-user relations. Each user can purchase entitlements to certain newspapers on a time basis (only a few editions); therefore the join conditions are derived from the match between the identifier of the newspaper and the one in the entitlement and the entitlement period to encapsulate the date of the article. Why denormalization is unsuitable for this scenario? The model for the column needs to contain denormalized information about all the users who are allowed to access it - this would represent a pollution of the column entity, but also extra computational effort on the ETL or map-reduce side and this would result in a degradation of the value of the application; moreover, changes occurring on an entitlement period for a certain user can alter millions of columns and this would trigger a process of reconstructing the consistency for the entitlements…eventually.

Conclusion

In the big data context, the best option for data modeling is denormalization - modern applications need high responsiveness and it does not worth to waste (execution) time to put back together the normalized data in order to offer to the user the logic entities. Of course, complete denormalization is not the best option for encapsulating a big many-to-many, as I have shown in the previous paragraph. To finish in a funny note, according to the title of the article: "normalization is for sissies", and denormalization is the solution.

NUMĂRUL 149 - Development with AI

Sponsori

  • Accenture
  • BT Code Crafters
  • Accesa
  • Bosch
  • Betfair
  • MHP
  • BoatyardX
  • .msg systems
  • P3 group
  • Ing Hubs
  • Cognizant Softvision
  • Colors in projects