In this part we cover the following topics
In Big Data concepts and terminology: Data and Big Data concepts and terminology: Difference between data and information we have introduced Data, Information and finally DIKW pyramid idea. The DIKW pyramid shows that data, produced by events, can be enriched with context to create information, information can be supplied with meaning to create knowledge and knowledge can be integrated to form wisdom, wich is at the top. All of this is mainly about how we think about data and how it may change through time when we work with it. From this point of view, we have no knowledge how we deal with data and all it's other forms, so now we are going to fill this knowledge gap.
Generalizing possible approaches the are three differnet stages when we work with data:
- Pure data At this stage we hava a somehow collected pure data and that cannot be derived from anything else. Other words, there is no means to infer pure data from source differnt than oryginal data source. Losing an access to data and data source means irreversible disaster, because there is no way to recovery what we have collected.
- Views Views are data (or information) that has been derived from pure data. They are used to assist with answering specific types of queries. Views in data processing gives us the same as views in (relational) databases systems.
A database view can combine data from two or more table, using joins, and also just contain a subset of information. This makes them convenient to abstract, or hide, complicated queries. Though a database view doesn’t store data, some refer to a views as virtual tables because we can query a view like we can aquery table.
Views can provide advantages over tables:
- Simplicity Views can join and simplify multiple tables into a single virtual table. This way views provide a "flattened" view of the database for reporting or ad-hoc queries.
- Preprocess data Views can act as aggregated tables, where the database engine aggregates data (sum, average, etc.) and presents the calculated results as part of the data. We can also present data in convenient way, for example change date format.
- Enforce business rules Views can be used to define business rules, such as when an items is active, or what is meant by “popular.” By placing complicated or misunderstood business logic into the view, you can be sure to present a unified portrayal of the data. This increases use and quality.
- Security Views can act as one of link in the security chain. We can grant no permissions on the pure data table and the same time create views that limits column or row access and grant permissions to users to see the view. This way views can restrict access to a table, yet allow users to access non-confidential data via views. For example, if we have a customer table, we might want to give all of our sales people access to the name, address, etc. fields, but not credit_card_number. We can create a view that only includes the columns they need access to and then grant them access on the view.
- Space Views take up very little space, as the data is stored once in the source table.
So generaly speaking databases views are designed to help query data. The same way data processing views are designed to be computed from the data and help answer queries about information "hidden" in pure data.
- Queries Queries are questions we ask of our data. With the queries we want to dig or discover information "hidden" in pure data. We can do this directly on pure data or with views help to make this process easier and understandable.
Our properties can fullfil many different conditions and can be in accordance with different set of rules. I'm an advocate of simplicity and I'm against complicating things if it does not bring profits. As it can be read in Time series databases: Everything changes. New trend of XXI century "...most systems work best if they are kept simple rather than made complicated; therefore simplicity should be a key goal in design and unnecessary complexity should be avoided." Because of this I want to focus now on most wanted and versatile set of properties. Whatever we think one is true: we don't want lose our date; we want it to be true now and in the future; we want to be able to get back with all changes we made on it. More technically we want it to be raw, immutable and eternally true.
Everything I do in my professional life I try to be faithful to my own rule: never ever destroy oryginal data. If I prepare an image for one of my tutorials I always keep oryginal scrennschot even if know I need only a small part of it. When I need to prepare a set of artificial data to be used during my classes I'm saving a source code of the program used to generate it. In most cases opertions I do on my data are not reversible. For example when we paint a line on a bitmap and save it we can't revert this action after reloading the image. That is why I always keep oryginal data.
Storing data in the rawest form possible is hugely valuable because maximizes our ability to obtain new insights, whereas any processing like aggregating, overwriting, or deleting limits what our data can tell us. Consider a following example
1 2 |
::: aggregation aggregation RAW RECORDS -------------> X -------------> Y |
Having raw data we can infer :::X. Having processed :::X we can make another aggregation but there is no chances to go back: raw data can not be resored from :::Y or even :::X Maybe it's not a strict rule but as we can see the rawer our data, the more answers we can get.
Keep in mind that
- Unstructured data is rawer than normalized data. Recal from physic or mathematic classes vector normalization. We can normalize vectors and relative magnitudes and directions will be preserved but this way we loose some information (oryginal absolute magnitudes) we can't restore
12345678910111213Before normalization:v = [1, 0], w = [2, 0]Normalize length to not exceed 1:|v| = 1, |w| = 2 calculate lengthl = max(|v|, |w|) calculate max lengthv_n = v / l w_n = w / l normalizev_n = [0.5, 0] w_n = [1, 0]Having only v_n and w_n there is no method to get back v and w. - More informations doesn't necessarily mean rawer data. Sometimes additional information serve only as the container for the contents and shouldn’t be part of our raw data. :::example needed
Data immutability means that we don’t update or delete data -- we only add more. By using an immutable schema for Big Data systems, we makes the system fault tolerant. Making our sysem, and data particulary, resistant to any faults is an essential property. Especialy faults generated by humans can be destructive. People make mistakes, and we must limit the impact of such mistakes and have mechanisms for recovering from them. With a mutable data model, a mistake can cause data to be lost, because values are actually overridden in the database. With an immutable data model, no data can be lost. If bad data is written, earlier (good) data units still exist. Fixing the data system is just a matter of deleting the bad data units and recomputing the views built from the row dataset.
One of the trade-offs of the immutable approach is that it uses more storage than a mutable schema. Rather than storing a current snapshot of the world, as done by the mutable schema, we create a separate record every time a data / information evolves. We track each data field so the entire history of data changes is stored rather than just the current view of the world. In such a case, when multiple instances of every data field exists we have to tie them to a moment in time when the information is known to be true.
1 2 3 4 5 6 7 |
:::example 1. Relational table with columns; table changes while updates 1.1 Updated table (we lost some data) 2. Relational table with columns; table changes while updates (new record with new timestamp and active marker which informs which record is active, up-to-date) |
Immutable row dataset is sometimes called a master dataset. Each piece of data comming from this set is true in perpetuity thanks to tagging it with a time stamp.
There are many ways we could choose to represent master dataset. We have a choice between traditional relational tables, structured XML, semistructured JSON documents or any other possibilities we only know. It's up to your needs and knowledge. In this part we want to describe another possibility known as the fact-based model.
Recalling the example from Most wanted properties of data: immutable form we can express it in the following form
1 2 3 4 5 6 |
:::example 1. Relational table with columns; table changes while updates 1.1 Updated table (we lost some data) 2. Each field is tracked in a separate table, and each row has a timestamp for when it’s known to be true. |
The most important fact-based model properties are
- timestamping
- atomicity
- identifiability
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
1. Initial state f1 f2 f3 ========== a1 b1 c1 a2 b2 c2 a3 b3 c3 2. Data update c2 -> c22 OR (bad) (better) f1 f2 f3 f1 f2 f3 t ========== ============== a1 b1 c1 a1 b1 c1 t1 a2 b2 c22 a2 b2 c2 t2 a3 b3 c3 a3 b3 c3 t3 a2 b2 c22 t4 3. Problem with nonexisting data f1 f2 f3 t ============== a1 b1 c1 t1 a2 b2 c2 t2 a3 b3 c3 t3 a2 b2 c22 t4 a4 -- c4 t5 <-- space neede to store nonexisting (NULL) data |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
Collect data in fact-based model 1. Initial state id f1 t id f2 t id f3 t ========== ========== ========== 1 a1 t1 1 b1 t1 1 c1 t1 2 a2 t2 2 b2 t2 2 c2 t2 3 a3 t3 3 b3 t3 3 c3 t3 2. Data update c2 -> c22 no change no change id f1 t id f2 t id f3 t ========== ========== ========== 1 a1 t1 1 b1 t1 1 c1 t1 2 a2 t2 2 b2 t2 2 c2 t2 3 a3 t3 3 b3 t3 3 c3 t3 2 c22 t4 Comment: When new value of existing data arrives, a new insert is made only to a one table. We have new fact: value of feature 3 of item 2 has been changed from c2 to c22, so change only this table. 3. Problem with nonexisting data no change id f1 t id f2 t id f3 t ========== ========== ========== 1 a1 t1 1 b1 t1 1 c1 t1 2 a2 t2 2 b2 t2 2 c2 t2 3 a3 t3 3 b3 t3 3 c3 t3 4 a4 t5 2 c22 t4 4 c4 t5 Comment: No need to store nonexistiing data. |
Advantages
- Is queryable at any time in its history
- Tolerates human errors
- Handles partial information. Storing one fact per record makes it easy to handle partial information about an entity without introducing NULL values into your dataset.