Big Data concepts and terminology

Informations presented on this page may be incomplete or outdated. You can find the most up-to-date version reading my book Engineering of Big Data Processing

In this part we cover the following topics

Definition
Terminology
Data analytics types

Big Data characteristics
- Volume
- Velocity
- Variety
- Veracity
- Value
- More "V"s?
Different types of data

Definition

In many different sources a term Big Data is defined as a data sets that are so voluminous and complex that traditional data processing application software are inadequate to deal with them. [...] Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.[W1] This kind of definition focus on important but not only one concept. One of the best definition, because short, descriptive and including much wider ideas, I have found in [B3]:

Big Data is a field dedicated to the analysis, processing, and storage of large collections of data that frequently originate from disparate sources.

So Big Data is not only about data itself but, which seems to be more important, about everyting we need to use large collections of data. Specifically, Big Data addresses distinct requirements, such as the combining of multiple unrelated datasets, processing of large amounts of unstructured data and harvesting of hidden information in a time-sensitive manner.

For the last few years we can observe something we can name Big Data Madness. Everyone anounce to use Big Data (together with Cloud Computing and Machine Learning) either they think they need them or they must say so to be in the mainstreem of contemporary state of the art technologies, according to the rule: whether I need this or not, if my business competitors have this, then I must have one too. Because of this Big Data may appear as a new discipline, but it has been developing for years. The management and analysis of large datasets has been a long-standing problem; recall one of the best known: the US census[W2].

In 1880, as new arrivals flooded into the United States and the population exploded, the US census turned into an administrative nightmare. The work of measuring and recording the fast-growing country’s population was maddeningly slow and expensive. Clerks would need eight years to finish compiling the census.
Next time, in 1890, the Census Bureau decided to use The Hollerith’s Punched Card Tabulator machine to work on census. It did the job in just two years, and saved the government US$5 million. By the way we got an extra new value. Not only could the machines count faster, but they could also understand information in new ways. Rearranging the wires on a tabulating machine allowed users to learn things they never knew they could learn, and at speeds no one thought possible. And this is something really precious which makes a real difference. A traditional analytic approaches, based on statistics, approximate measures of something via sampling it. Big Data adds to this a possibility to processing of entire datasets, making such sampling unnecessary and analysis more accurate.

Talking about definition, we have to stress that the one selectd above is the best for me but not the only one. The whole spectrum of ideas focused under the Big Data name blends mathematics, statistics, computer science and subject matter expertise. This mixture has led to some confusion as to what comprises the field of Big Data and the response one receives will be dependent upon the perspective of whoever is answering the question. What is more, the boundaries of what constitutes a Big Data problem are also changing due to the ever-shifting and advancing landscape of software and hardware technology. This is due to the fact that the definition of Big Data takes into account the impact of the data’s characteristics on the design of the solution environment itself. Long time ago, in 90s, I had a PC with 4MiB RAM and I could play the most demanding games, while one gigabyte of data was a Big Data problem requiring special purpose computing resources. Now, gigabytes of data are commonplace and can be easily transmitted, processed and stored even on a mobile phone.

The applications and potential benefits of Big Data are broad including, but not limited to

optimization,
identification,
predictions,
fault and fraud detection,
improved decision-making,
scientific discoveries.

However, please don't fall into Big Data Madness and remember to considered numerous issues when adopting Big Data analytics approaches. These issues need to be understood and well weighed against anticipated benefits.

Terminology

Property types

We have two main types of properties:

qualitative (pl. jakościowy),
quantitative (pl. ilościowy).

Qualitative properties are properties which can be observed but cannot be computed nor measured with a numerical result. We use this type of properties to describe a given topic in more abstract way including even impressions, opinions, views and motivations. This brings depth of understanding to a topic but also makes it harder to analyze. We consider this type of properties as being unstuctured. When the data type is qualitative the analysis is non-statistical. Qualitative properties asks (or answers) Why?

Quantitative properties are focuses on numbers and mathematical calculations and can be calculated and computed. We consider this type of properties as beeing structured and statistical. Quantitative properties asks (or answers) How much? or How many?.

The same but in the comparision form

Factor	Qualitative	Quantitative
Meaning	The data in which the classification of objects is based on attributes and properties.	The data which can be measured and expressed numerically.
Analysis	Non-Statistica	Statistical
Type of data	Unstructured	Structured
Question	Why?	How many or How much?
Used to	Get initial understanding.	Recommends final course of action.
Methodology	Exploratory	Conclusive

Discrete and continuous property

Our properties can be understood as the quantitative information about a given topic. If the property characteristic is qualitative it should be transformed into quantitative one, by providing numerical data of that characteristic (we have to map qualitative information into numbers), for the purpose of statistical analysis.
The quantitative characteristic can be considered as being

discrete, or
continuous.

Discrete is a type of statistical information that can assume only fixed number of distinct values and sometimes lacks an inherent order. Also known as a categorical, because it has separate, invisible categories.

Continuous is a type of statistical information that can assume all the possible values within the given range. If a property can take an infinite and uncountable set of values, then the property is referred as a continuous.

The same but in the comparision form

Factor	Discrete	Continuous
Meaning	Refers to the variable that assumes a finite number of isolated values.	Refers to the variable which assumes infinite and uncountable number of different values.
Represented by	Isolated points.	Connected points.
Values (provenance)	Values are obtained by counting.	Values are obtained by measuring.
Values (assume)	Distinct or separate values.	Any value between the two values.
Classification	Non-overlapping.	Overlapping.

Data and datasets

Data is a set of values of qualitative or quantitative variables (properties). Pieces of data are individual pieces of information.

The Latin word data is the plural of datum, (en. (thing) given, pl. dawać/ofiarować) neuter past participle of dare (en. to give, pl. daję). In consequence, datum should be used in the singular and data for plural, though, in non-specialist, everyday writing, data is most commonly used in the singular, as a mass noun (like "information", "sand" or "rain"). Saying the truth, I observe this tendency to be more and more popular. The first English use of the word data is from the 1640s. Using the word data to mean transmittable and storable computer information was first done in 1946. The expression data processing was first used in 1954.[W3]

Datasets is a collections or groups of related data. Each group or dataset member (datum) shares the same set of attributes or properties as others in the same dataset.

Difference between data and information

Data is a dumb set of values. Nothing more. When the data is processed and transformed in such a way that it becomes useful to the users, it is known as information.

Raw facts we collect about an events, ideas, entities or anything else, is called data. In most basic case, data are simple text and numbers. We turn raw facts into information when we process and interprete it. So when data starts to ,,speek'', when something valueless is turned into priceless, we have an information. The same thing may be considered as data or information depending on the context it is used.

The same but in the comparision form

Factor	Data	Information
Meaning	Data means raw facts gathered about someone or something, which is bare and random.	Facts, concerning a particular event or subject, which are refined by processing.
Form. What is it?	It is just text and numbers.	It is refined data.
Physical form	Unorganized	Organized
Is useful?	Who knows? May or may not be.	Always

DIKW pyramid

Going further with data "transformations" we reach to DIKW (data, information, knowledge, wisdom) pyramid. The DIKW pyramid shows that data, produced by events, can be enriched with context to create information, information can be supplied with meaning to create knowledge and knowledge can be integrated to form wisdom, wich is at the top.

There is a nice saying (by Miles Kington):

Knowledge is knowing a tomato is a fruit.
Wisdom is not putting it in a fruit salad

And this is true essence of the problem we are discusse.

This way we increase the value we have from hindsight through insight to foresight.

Hindsight: understanding of a situation or event only after it has happened or developed.
Insight: the capacity to gain an accurate and deep intuitive understanding of a person or thing.
Foresight: the ability to predict or the action of predicting what will happen or be needed in the future.

Example:
H: Understend why the plane crashed.
I: Understand what is going on with the plane at the moment and whether it will result in crash soon.
F: Predict that the plane may crash in the future if the engines will be not maintained.

Data analysis

Data analysis is the process of examining data to find facts, relationships, patterns, insights and/or trends. The overall goal of data analysis is to

discover useful information,
make conclusions,
support better decision-making.

Data analytics

Data analytics is a term that encompasses data analysis. Data analytics is a discipline that includes the management of the complete data lifecycle, which encompasses collecting, cleansing, inspecting, organizing, storing, transforming, analyzing and modeling data. The term includes the development of analysis methods, scientific techniques and automated tools. Data analytics enable data-driven decision making with scientific backing so that decisions can be based on factual data/information and not simply on past experience or intuition alone.

Data analytics types

There are four general categories of analytics that are distinguished by the results they produce

descriptive (pl. opisowy),
diagnostic (pl. diagnozujący),
predictive (pl. przewidujący),
prescriptive (pl. nakazowy).

The order of categories matter: value and complexity increase from descriptive to prescriptive analytics. Descriptive analytics can be concerned as pure data, while prescriptive as optimization. The different analytics types leverage different techniques and analysis algorithms. This implies that there may be varying data, storage and processing requirements to facilitate the delivery of multiple types of analytic results.

Descriptive analytics

Descriptive analytics are carried out to answer questions about events that have already occurred: What has happened? This type of analytics use data aggregation and data mining to provide insight into the past.

Descriptive analysis or statistics does exactly what the name implies: they describe, or summarize raw data and make it something that is interpretable by humans. They are analytics that describe the past. The past refers to any point of time that an event has occurred, whether it is one minute ago, or one year ago. Descriptive analytics are useful because they allow us to learn from past behaviors, and understand how they might influence future outcomes. For example, descriptive analytics examines historical electricity usage data to help plan power needs and allow electric companies to set optimal prices.

Most of analytics results are descriptive in nature. Descriptive analytics provide the least worth and require a relatively basic skillset. Descriptive analytics are often carried out via ad-hoc reporting or dashboards.

Diagnostic analytics

Diagnostic analytics also concern past but answers question: Why did it happen?

Diagnostic analytics aim to determine the cause of a fact that occurred in the past using questions that focus on the reasons behind the events. The goal of this type of analytics is to determine what information is related to something in order to be able to answer questions that seek to determine why it has occurred.

Diagnostic analytics provide more value than descriptive analytics but require a more advanced skillset. It usually require collecting data from multiple sources and storing it in a structure that allows easily perform various analysis. Diagnostic analytics results are often viewed via visualization tools that enable users to identify trends and patterns.

Predictive analytics

Predictive analytics, as descriptive and diagnostic, also concern past but looks into the future and answers question: What will happen? It use statistical models and various "forecasts" techniques to analyze historical and current facts in case to understand the possible future and predict it with the highest possible probability.

With this type of analytics we generate future predictions based upon past events. To do this, some models of the past are created. It is very important to understand that these models are very tighty connected with the conditions under which the past events occurred. If these underlying conditions change, our predition will fail and then the models that make predictions need to be updated. The problem is that we can not change our models ahead of time -- we can do this only after something happens. Even if we know that conditions are changing we have to wait and reconcile with thoughts that our prediction will we incorrect.

This kind of analytics involves the use of large datasets and various data analysis techniques. It provides greater value than both descriptive and diagnostic analytics for the price of a more advanced skillset. The tools used varies and it is very common to use few of them. Various tools and languages, broad theoretical knowledge and unconventional approaches and open mind may be needed to do something in the field of predictive analytics.

Prescriptive analytics

There will be nothing strange if we say that also this type of analytics concern past to think about future but in this case answers question: What should we do? It use optimization and simulation algorithms to tell us what we should do to get assumed result.

Prescriptive analytics attempt to prescribing a number of different possible action "sequences" that should be taken to reach a goal. What is important, these analytics attempt to evaluate the effect of future decisions and actions before they are actually made. In this sens, prescriptive analytics is much more tha predictive analytics, because it not only predicts what will happen, but also why it will happen. So the focus is not only on which prescribed option is best to follow, but why.

This is the most demanding analytics require much more than predictive analytics. To be successful in this area we have to be very often an expert in the field of interes. For example to answer the question When is the best time to trade a particular stock? we have to deeply understand stock market and all its niuances; the best option is we are a trader. This type of job can not be outsourced to "general purposes" company or staff.

Big Data characteristics

For a dataset to be considered Big Data, it must possess more than one characteristics commonly referred to as the Five Vs:

volume,
velocity,
variety,
veracity,
value.

These five Big Data characteristics are used to help differentiate data categorized as Big from other forms of data. Most of them were initially identified by Doug Laney in early 2001 when he published an article describing the impact of the volume, velocity and variety of e-commerce data on enterprise data warehouses. To this list, veracity has been added to emphasize the value of all data with low signal-to-noise ratio. The goal of Big Data usage is to conduct analysis of the data in such a manner that high-quality results are delivered in a timely manner, providing optimal value to the enterprise.

Volume

It is assumed that today organizations and users world-wide create over 2.5 EBs (2^60, more than 10^18, 5 bilions DVD) of data a day. As a point of comparison, the Library of Congress currently holds more than 300 TBs (2^40, more than 10^12, 65000 DVD) of data. Almost any aspect of our live is or can be a source of data (another question is if we really need them) and they can be generated by human, machines as well as environment. The most common are (H is used to denote human, M -- machines, E -- environment)

(H) social media, such as Facebook and Twitter,
(H) scientific and research experiments, such as physical simulation,
(H) online transactions, such as point-of-sale and banking,
(M) sensors, such as GPS sensors, RFIDs, smart meters and telematics,
(M) any woking device we monitor for our safety, such as planes or cars,
(E) weather conditions, cosmic radiation.

Regardless of the generator (human, machines or environment) it is ultimately the responsibility of machines to generate the analytic results.

How much data?

Every minute

4.166.667 likes made by Facebook users,
347.222 tweets on Twieeter,
100.040 cals on Skype,
77.166 hours of movies from Netflix,
694 Uber users take ride,
51.000 application are downloaded from AppStore.

Jarosław Sobel, Przetwarzanie Big Data, IT professional, 9/2018, p. 10

Data Never Sleeps 5.0, access: 2019-02-16
How Much Data Is Generated Every Minute?, access: 2019-02-16
How Much Data is Created on the Internet Each Day?, access: 2019-02-16
How much data do we create every day? , access: 2019-02-16
How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read, access: 2019-02-16

[A5, W26, W27, W28, W29, W30]

Today, we collect all data we can get regardless whether we really need them or not. The term data lake was created for this type of data "collection". They acts as a lake with lots of different things inside, not always easily accessible and visible. We do this (collect data) not necessarily because we need it but because we can. Maybe one day we will need it, we think. Over time, large amounts of data can hide what we previously stored. This is like my basement: I know that some elements I need are somewhere in it. Unfortunately, because there are so many elements, it's cheaper, easier and faster to buy new one than try to find them in my basement element lake.

Velocity

Velocity in environment plays the role (and justify Big Data name) when lots of data arrive at fast speed which may result in accumulation of enormous datasets within very short period of time. From an enterprise’s point of view, the velocity of data translates into the amount of time it takes for the data to be processed once the enterprise come into their possession.

Variety

Data variety refers to the multiple formats and types of data that need to be supported by Big Data solutions. It also concers data format variablility -- the data comming from the same source may arrive in different formats. Data variety brings challenges for enterprises in terms of data integration, transformation, processing, and storage.

Veracity

Veracity refers to the quality or fidelity of data. If we want to do somethin valuable with data it need to be assessed for quality. Because any data can be a part of the signal or noise, it lead to data processing activities to resolve invalid data and remove noise. Noise is data that cannot be converted into information and thus has no value, whereas signals have value and lead to meaningful information. Data with a high signal-to-noise ratio has more veracity than data with a lower ratio. The signal-to-noise ratio of data is dependent upon the source of the data and its type. For example data that is acquired via online customer registrations, usually contains less noise than data acquired via blog postings. In the first case we have to provide true data while in the second we may provide whatever just to meet some blog's requirements.

Value

Value is defined as the usefulness of data. In most cases it is considered in relation to problem -- in many cases to the data processing time. A 20 minute delayed stock quote has no value for making a trade compared while we can wait several months to get results of a complex physical simulation. Simplifying the issue, you can say that, value and time are inversely related. The longer it takes for data to be turned into meaningful information, the less value it has for a business. Outdated information, even if of very good quality and high probability are totally useless.

More "V"s?

In this chapter five "V"s were described. It's worth to note, that initialy (2001) there was only three "V"s: volume, velocity, variety extended later by veracity and next by value. Along with how the area and method of using big data changes, the characteristics of what we consider as big data (what big data should be) also change. According to some sources [W17] today big data features may be summed up in 10 different Vs like: volume, velocity, variety, veracity, value, validity, volatility, variability, viscosity and vulnerability.

Different types of data

As demonstrated, data can come from a variety of sources and be represented in various formats or types. The primary formats of data are:

structured data,
unstructured data,
semi-structured data,
metadata.

Structured data

Structured data conforms to a data model or schema and is often stored in tabular form. It is used to describe relationships between different entities and in consequence most often stored in a relational database.

Unstructured data

Data that does not conform to a data model or data schema is known as unstructured data. It is estimated that unstructured data makes up 80% of the data within any given enterprise. Unstructured data has a faster growth rate than structured data.

Semi-structured data

Semi-structured data has a defined level of structure and consistency, but is not relational in nature. Instead, semi-structured data is hierarchical or graph-based. This kind of data is commonly stored in files that contain text. For instance, XML and JSON files are common forms of semi-structured data. Due to the textual nature of this data and its conformance to some level of structure, it is more easily processed than unstructured data.

Metadata

Metadata is data which are not data themself but provides information about a dataset’s characteristics and structure. The tracking of metadata is crucial to Big Data processing, storage and analysis because it provides information about the data birth as well as all processing steps. In many cases metadata helps to process data. For example we may keep some metadata about the image resolution and number of colors used. Of course we may get this data from graphic file, but for the price of longer processing time.