Skip to content

Data model for Big Data


In this part we cover the following topics


Data, views, and queries

In Big Data concepts and terminology: Data and Big Data concepts and terminology: Difference between data and information we have introduced Data, Information and finally DIKW pyramid idea. The DIKW pyramid shows that data, produced by events, can be enriched with context to create information, information can be supplied with meaning to create knowledge and knowledge can be integrated to form wisdom, wich is at the top. All of this is mainly about how we think about data and how it may change through time when we work with it. From this point of view, we have no knowledge how we deal with data and all it's other forms, so now we are going to fill this knowledge gap.

Generalizing possible approaches the are three differnet stages when we work with data:

  • Pure data At this stage we hava a somehow collected pure data and that cannot be derived from anything else. Other words, there is no means to infer pure data from source differnt than oryginal data source. Losing an access to data and data source means irreversible disaster, because there is no way to recovery what we have collected.
  • Views Views are data (or information) that has been derived from pure data. They are used to assist with answering specific types of queries. Views in data processing gives us the same as views in (relational) databases systems.

    A database view can combine data from two or more table, using joins, and also just contain a subset of information. This makes them convenient to abstract, or hide, complicated queries. Though a database view doesn’t store data, some refer to a views as virtual tables because we can query a view like we can aquery table.

    Views can provide advantages over tables:

    • Simplicity Views can join and simplify multiple tables into a single virtual table. This way views provide a "flattened" view of the database for reporting or ad-hoc queries.
    • Preprocess data Views can act as aggregated tables, where the database engine aggregates data (sum, average, etc.) and presents the calculated results as part of the data. We can also present data in convenient way, for example change date format.
    • Enforce business rules Views can be used to define business rules, such as when an items is active, or what is meant by “popular.” By placing complicated or misunderstood business logic into the view, you can be sure to present a unified portrayal of the data. This increases use and quality.
    • Security Views can act as one of link in the security chain. We can grant no permissions on the pure data table and the same time create views that limits column or row access and grant permissions to users to see the view. This way views can restrict access to a table, yet allow users to access non-confidential data via views. For example, if we have a customer table, we might want to give all of our sales people access to the name, address, etc. fields, but not credit_card_number. We can create a view that only includes the columns they need access to and then grant them access on the view.
    • Space Views take up very little space, as the data is stored once in the source table.

    So generaly speaking databases views are designed to help query data. The same way data processing views are designed to be computed from the data and help answer queries about information "hidden" in pure data.

  • Queries Queries are questions we ask of our data. With the queries we want to dig or discover information "hidden" in pure data. We can do this directly on pure data or with views help to make this process easier and understandable.


Most wanted properties of data

Our properties can fullfil many different conditions and can be in accordance with different set of rules. I'm an advocate of simplicity and I'm against complicating things if it does not bring profits. As it can be read in Time series databases: Everything changes. New trend of XXI century "...most systems work best if they are kept simple rather than made complicated; therefore simplicity should be a key goal in design and unnecessary complexity should be avoided." Because of this I want to focus now on most wanted and versatile set of properties. Whatever we think one is true: we don't want lose our date; we want it to be true now and in the future; we want to be able to get back with all changes we made on it. More technically we want it to be raw, immutable and eternally true.


Most wanted properties of data: raw form

Everything I do in my professional life I try to be faithful to my own rule: never ever destroy oryginal data. If I prepare an image for one of my tutorials I always keep oryginal scrennschot even if know I need only a small part of it. When I need to prepare a set of artificial data to be used during my classes I'm saving a source code of the program used to generate it. In most cases opertions I do on my data are not reversible. For example when we paint a line on a bitmap and save it we can't revert this action after reloading the image. That is why I always keep oryginal data.

Storing data in the rawest form possible is hugely valuable because maximizes our ability to obtain new insights, whereas any processing like aggregating, overwriting, or deleting lim­its what our data can tell us. Consider a following example

Having raw data we can infer :::X. Having processed :::X we can make another aggregation but there is no chances to go back: raw data can not be resored from :::Y or even :::X Maybe it's not a strict rule but as we can see the rawer our data, the more answers we can get.

Keep in mind that

  • Unstructured data is rawer than normalized data. Recal from physic or mathematic classes vector normalization. We can normalize vectors and relative magnitudes and directions will be preserved but this way we loose some information (oryginal absolute magnitudes) we can't restore
  • More informations doesn't necessarily mean rawer data. Sometimes additional information serve only as the container for the contents and shouldn’t be part of our raw data. :::example needed


Most wanted properties of data: immutable form

Data immutability means that we don’t update or delete data -- we only add more. By using an immutable schema for Big Data systems, we makes the system fault tolerant. Making our sysem, and data particulary, resistant to any faults is an essential prop­erty. Especialy faults generated by humans can be destructive. People make mistakes, and we must limit the impact of such mistakes and have mechanisms for recovering from them. With a muta­ble data model, a mistake can cause data to be lost, because values are actually overridden in the database. With an immutable data model, no data can be lost. If bad data is written, earlier (good) data units still exist. Fixing the data system is just a matter of deleting the bad data units and recomputing the views built from the row dataset.

One of the trade-offs of the immutable approach is that it uses more storage than a mutable schema. Rather than storing a current snapshot of the world, as done by the mutable schema, we create a separate record every time a data / information evolves. We track each data field so the entire history of data changes is stored rather than just the current view of the world. In such a case, when multiple instances of every data field exists we have to tie them to a moment in time when the information is known to be true.


Most wanted properties of data: perpetual form

Immutable row dataset is sometimes called a master dataset. Each piece of data comming from this set is true in perpetuity thanks to tagging it with a time stamp.



There are many ways we could choose to represent master dataset. We have a choice between tradi­tional relational tables, structured XML, semistructured JSON documents or any other possibilities we only know. It's up to your needs and knowledge. In this part we want to describe another possibility known as the fact-based model.

Recalling the example from Most wanted properties of data: immutable form we can express it in the following form

The most important fact-based model properties are

  • timestamping
  • atomicity
  • identifiability

Advantages

  • Is queryable at any time in its history
  • Tolerates human errors
  • Handles partial information. Storing one fact per record makes it easy to handle partial information about an entity without introducing NULL values into your dataset.