Processing concepts for Big Data

Informations presented on this page may be incomplete or outdated. You can find the most up-to-date version reading my book Engineering of Big Data Processing

In this part we cover the following topics

Terminology
Batch processing
Real-time processing
- Real-time faces
- Streaming data processing
SCV principle

Terminology

In this section some basic terminology related to data processing (in generall, not only Big Data) is given. We will not dive to deep into details and will say only as much as we need to understand what we will be talking about now and in the future material.

Divide-and-conquer approach In computer science, divide and conquer is an algorithm design paradigm based on multi-branched recursion. A divide-and-conquer algorithm works by recursively breaking down a problem into two or more sub-problems of the same or related type, until these become simple enough to be solved directly. The solutions to the sub-problems are then combined to give a solution to the original problem.
- Solving difficult problems
- Algorithm efficiency
- Parallelism
- Memory access
- Roundoff control
Following are some standard algorithms that are divide and conquer algorithms:
- Binary search
- Merge sort
- Quicksort
- Closest pair of points
- Strassen algorithm for efficient matrices multiplication
- Karatsuba algorithm for fast multiplication
- Cooley--Tukey Fast Fourier Transform (FFT) algorithm
To be honest, divide-and-conquer approach is something we have to use, not exactly something we want to use. It is much easier to implement algorithm processing data step by step than in "chunks". One problem with step by step approach is that we can not speed up this process infinitely.
Parallel data processing
Distributed data processing Distributed data processing is closely related to parallel data processing in that the same principle of "divide-and-conquer" is applied. However, distributed data processing is always achieved through physically separate machines that are networked together.
Cluster, grid and cloud
A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. Clusters have each node set to perform the same task, controlled and scheduled by software.

Grids are a form of distributed computing whereby a "super virtual computer" is composed of many networked loosely coupled computers acting together to perform large tasks. Grid computing combines computers from multiple administrative domains to reach a common goal, to solve a single task, and may then disappear if no need any more. Grid computers also tend to be more heterogeneous and geographically dispersed (thus not physically coupled) than cluster computers.

Grid is a "system" that manages resources under the control of different computers and connected computer networks, uses open protocols and general purpose interfaces (discovery and access to resources, authorization, authentication) and provides services of appropriate quality (QoS, offers higher-level services). The goal of grid technology is to create a simple yet powerful virtual computer with a huge number of connected, heterogeneous systems sharing various resources. Grid resources can be administered by various organizations. Sharing resources follows the local resource management policy used in given organization. Resources in grid have at least some of the following features:
- geographically dispersed,
- heterogeneous hardware and software,
- dynamic (availability variable over time),
- potentially unreliable,
- owned and managed by various organizations,
- various security requirements and policies,
- different resource management policies,
- connected by a heterogeneous network (various layers, communication protocols, topologies).
Cloud computing makes computer system resources, especially storage and computing power, available on demand without direct active management by the user. The term is generally used to describe data centers available to many users over the Internet. If the connection to the user is relatively close, it may be designated an Edge server.
Task Parallelism Task parallelism refers to the parallelization of data processing by dividing a task into sub-tasks and running each sub-task on a separate node.
Data Parallelism Data parallelism refers to the parallelization of data processing by dividing a dataset into multiple datasets and processing each sub-dataset in parallel.

When it comes to big data, there are two main ways to process information. The first -- and more traditional -- approach is batch processing, the second is real-time processing.

Batch processing

As the term implies, batch-based data processing involves collecting a series of data, storing it until a given quantity of data has been collected, then processing all of that data as a group -- in other words, as a batch. It’s much different from processing each piece of data as it is collected. With older technologies, it was typically more efficient to process information in batches rather than working with each small piece of data individually. Doing so reduces the number of discrete I/O events that need to take place. It can also help to save network bandwidth by compressing data within batches. Very often batch processing is the processing of a large volume of data, that is collected over a period of time, all at once. It means that jobs are typically completed on well known data set (this data set is defined and known before we start processing) in non-stop, sequential order. It can be used offline (very often offline processing term is used instead of batch processing) and gives complete control as to when to start the processing. This way we can delay (postpone) processing till the computer is not executing very many tasks. This helps balance the overall load and system utilization. Obviously, because of data size, it will take large amount of time for that data to be processed. Batch processing works well in situations where we don’t need real-time analytics results, and when it is more important to process large volumes of data to get more detailed insights than it is to get fast analytics results. This type of workload can be characterized as

involving large amounts of data,
using sequential read/writes (from/to persistance storage) of many data at once,
comprising of groups of read or write queries
and responding with high-latency.

Pros: It is relatively simple, easy to set up and low in cost compared to realtime mode.
Cons: In batch mode, data is processed offline in batches and the response time could vary from minutes to hours. In most cases data must be persisted to the persistent storage before it can be processed.

Hadoop is probably the best-known big data framework today that was designed first and foremost for batch processing.

Batch processing with MapReduce

No doubt MapReduce is the best known algorithm of batch processing type having its roots in divide-and-conquer principle.

MapReduce does not require that the input data conform to any particular data model. Therefore, it can be used to process schema-less datasets. According to the divide-and-conquer principle a whole dataset is divided into multiple smaller parts, and a set of operations is performed on each part independently and in parallel. The results from all operations are then aggregated (in a sense: summarized) to form a final result. Although in many cases MapReduce returns result much faster than traditional algorithms, it is not expected to have low latency.

The MapReduce processing engine works differently compared to the traditional data processing paradigm. Traditionally, data processing requires moving data from the storage node to the processing node that runs the data processing algorithm. This approach works fine for smaller datasets. With large datasets, moving data can result in high network bandwidth utilization and high latency. In consequence for large amount of the data moving it generates more overhead than the actual processing.

With MapReduce, an algorithm instead of the data is moved to the nodes that store the data. The data processing algorithm executes in parallel on these nodes, thereby eliminating the need to move the data first. The real problem with MapReduce is how to divide your data so it will be on correct processing node. Column families NoSQL databases helps to accomplish this.

A single processing run of the MapReduce is composed of a map task and a reduce task and each task consists of multiple stages:

map: split, map, [combine], partition;
reduce: shuffle, sort, reduce.

Split stage

K1, V1

Map stage

K2, V2

list(K2, V2)

Combine stage

A combiner is essentially a reducer function that locally groups a mapper’s output on the same node as the mapper. A reducer function can be used as a combiner function, or a custom user-defined function can be used.
The MapReduce engine combines all values for a given key from the mapper output, creating multiple key-value pairs as input to the combiner where the key is not repeated and the value exists as a list of all corresponding values for that key. The combiner stage is only an optimization stage, and may therefore not even be called by the MapReduce engine. For example, a combiner function will work for finding the largest or the smallest number, but will not work for finding the average of all numbers since it only works with a subset of the data.

Partition stage

Depending on the nature of the job, certain reducers can sometimes receive a large number of key-value pairs compared to others. As a result of this uneven workload, some reducers will finish earlier than others. Overall, this is less efficient and leads to longer job execution times than if the work was evenly split across reducers. This can be rectified by customizing the partitioning logic in order to guarantee a fair distribution of key-value pairs.

Shuffle and sort stage

Next, the MapReduce engine automatically groups and sorts the key-value pairs according to the keys so that the output contains a sorted list of all input keys and their values with the same keys appearing together. The way in which keys are grouped and sorted can be customized.

This merge creates a single key-value pair per group, where key is the group key and the value is the list of all group values.

Reduce stage

A simple MapReduce example

Input For simplicity, let our input consist of 10 persons: Anna (12), Bob (34), Celine (63), Diana (51), Edmund (55), Fracesco (67), Giselle (15), Henry (57), Isabell (32), Jane (39)
Split We divide the input in three splits. This will distribute the work among all the map nodes.
- split 1: Anna (12), Bob (34), Celine (63); set of pairs: {"anna": 12, "bob": 34, "celine": 63}
- split 2: Diana (51), Edmund (55), Fracesco (67); set of pairs: {"diana": 51, "edmund": 55, "francesco": 67}
- split 3: Giselle (15), Henry (57), Isabell (32), Jane (39); set of pairs: {"giselle": 15, "henry": 57, "isabell": 32, "jane": 39}
Map The mapping process remains the same on all the nodes. In each of the mapper we give a hardcoded value (1) to each of the person. The rationale behind giving a hardcoded value equal to 1 is that every given person, in itself, will occur once. Next a list of key-value pair is created where the key is nothing but the age range and value is one. So, for our set of splits we have the following list of a key-value pairs:
- node 1: {"10-19": 1, "30-39": 1, "60-69": 1}
- node 2: {"50-59": 1, "50-59": 1, "60-69": 1}
- node 3: {"10-19": 1, "50-59": 1, "30-39": 1, "30-39": 1}
Combine Combines all values for a given key from the mapper output, creating multiple key-value pairs as input to the combiner where the key is not repeated and the value exists as a list of all corresponding values for that key.
- node 1: {"10-19": [1], "30-39": [1], "60-69": [1]}
- node 2: {"50-59": [1, 1], "60-69": [1]}
- node 3: {"10-19": [1], "50-59": [1], "30-39": [1, 1]}
A combiner is essentially a reducer function (see next stage) that locally groups a mapper’s output on the same node as the mapper. A reducer function can be used as a combiner function, or a custom user-defined function can be used. As a result we get
- node 1: {"10-19": 1, "30-39": 1, "60-69": 1}
- node 2: {"50-59": 2, "60-69": 1}
- node 3: {"10-19": 1, "50-59": 1, "30-39": 2}
Partition The output from the previous stage is divided into partitions between reducer instances.
- node 1: {"10-19": 1 -> P1, "30-39": 1 -> P1, "60-69": 1 -> P3}
- node 2: {"50-59": 2 -> P2, "60-69": 1 -> P3}
- node 3: {"10-19": 1 -> P1, "50-59": 1 -> P2, "30-39": 2 -> P1}
Although each partition contains multiple key-value pairs, all records for a particular key are assigned to the same partition.
Shuffle and sort
Output from all partitioners is copied across the network to the nodes running the reduce task (shuffling).
- partition on node 1: {"10-19": 1, "30-39": 1, "10-19": 1, "30-39": 2}
- partition on node 2: {"50-59": 2, "50-59": 1}
- partition on node 3: {"60-69": 1, "60-69": 1}
Next, the MapReduce engine automatically groups
- partition on node 1: {"30-39": 1, "30-39": 2, "10-19": 1, "10-19": 1}
- partition on node 2: {"50-59": 2, "50-59": 1}
- partition on node 3: {"60-69": 1, "60-69": 1}
and sorts the key-value pairs according to the keys so that the output contains a sorted list of all input keys and their values with the same keys appearing together.
- node 1: {"10-19": [1, 1], "30-39": [1, 2]}
- node 2: {"50-59": [2, 1]}
- node 3: {"60-69": [1, 1]}
Reduce
The reducer will summarize its input. For each key-value pair that a reducer receives, the list of values stored in the value part of the pair is processed and another key-value pair is written out.
- node 1: {"10-19": 2, "30-39": 3}
- node 2: {"50-59": 3}
- node 3: {"60-69": 2}
Output
- "10-19": 2
- "30-39": 3
- "50-59": 3
- "60-69": 2

MapReduce vs classic divide-and-conquer approach

"Dividing a problem to smaller ones until the individual problems can be solved independently and then combining them to answer the original question is known as the divide and conquer algorithm design technique.
Recently, this approach to solve computational problems especially in the domain of very large data sets has been referred to as MapReduce rather than divide and conquer.
My question is as follows: Is MapReduce anything more than a proprietary framework that relies on the divide and conquer approach, or are there details to it that make it unique in some respect?"

There are a lot of discussion, starting with the question similar to the above, wether is there (conceptual) novelty in MapReduce somewhere, or is it just a new implementation of old ideas useful in certain scenarios -- see
[W12, W13, W14] just to mention few examples. A few years ago, MapReduce was hailed as revolution of distributed programming. There have also been critics [W15] but by and large there was an enthusiastic hype and it even got patented [W16]. So, is MapReduce something new compared to well known divide-and-conquer approach? The answers varies -- let's see some of them.

There is no novelty. [...] nothing new in computation, or even distributed computing was discovered by MapReduce.
It's not a super-sophisticated concept, but a very useful piece of infrastructure. MapReduce is a framework for implementing divide-and-conquer algorithms in an extremely scalable way, by automatically distributing units-of-work to nodes in an arbitrarily large cluster of computers and automatically handling failures of individual nodes by redistributing the unit-of-work to another node.
If you're asking about the MapReduce architecture, then it is very much just a divide and conquer technique. However, any useful MapReduce architecture will have mountains of other infrastructure in place to efficiently "divide", "conquer", and finally "reduce" the problem set. With a large MapReduce deployment (1000's of compute nodes) these steps to partition the work, compute something, and then finally collect all results is non-trivial. Things like load balancing, dead node detection, saving interim state (for long running problems), are hard problems by themselves.
The paradigm is named after Lisp's map() and reduce() functions. It was certainly a new idea to make a massively parallel version of map and reduce.
Because Google did it. There's nothing particularly wrong with it, but it isn't particularly novel, either. People have been doing this kinds of parallel processing for decades.
MapReduce is more about segregation and aggregation.
MapReduce is not simply a divide and conquer technique, though it looks that way in many examples. In the mapping step you can and frequently want to do a one-to-many relation. Thus you're not simply dividing into cases.
MapReduce diverges from most divide and conquer systems in a fairly fundamental way, but one that's so simple that many people almost miss it. The real genius of it is in tagging the intermediate results.
In a typical (previous) divide and conquer system, you divide the work up serially, execute work packets in parallel, and then merge the results from that work serially again.
In MapReduce, you divide the work up serially, execute work packets in parallel, and tag the results to indicate which results go with which other results. The merging is then serial for all the results with the same tag, but can be executed in parallel for results that have different tags.
MapReduce is not divide and conquer. It does not involve the repeated application of an algorithm to a smaller subset of the previous input. It's a pipeline (a function specified as a composition of simpler functions) where pipeline stages are alternating map and reduce operations. Different stages can perform different operations.
MapReduce does not break new ground in the theory of computation -- it does not show a new way of decomposing a problem into simpler operations. It does show that particular simpler operations are practical for a particular class of problem.
The MapReduce paper's contribution was:
- Evaluating a pipeline of two well understood orthogonal operators that can be distributed efficiently and fault-tolerantly on a particular problem: creating a text index of large corpus.
- Benchmarking map-reduce on that problem to show how much data is transferred between nodes and how latency differences in stages affect overall latency.
- Showing how to make the system fault tolerant so machine failures during computation can be compensated for automatically.
- Identifying specific useful implementation choices and optimizations.
Some of the critiques fall into these classes:
- "Map/reduce does not break new ground in theory of computation." True. The original paper's contribution was that these well-understood operators with a specific set of optimizations had been successfully used to solve real problems more easily and fault-tolerantly than one-off solutions.
- "This distributed computation doesn't easily decompose into map & reduce operations". Fair enough, but many do.
- "A pipeline of n map/reduce stages require latency proportional to the number of reduce steps of the pipeline before any results are produced." Probably true. The reduce operator does have to receive all its input before it can produce a complete output.
- "Map/reduce is overkill for this use-case." Maybe. When engineers find a shiny new hammer, they tend to go looking for anything that looks like a nail. That doesn't mean that the hammer isn't a well-made tool for a certain niche.
- "Map/reduce is a poor replacement for a relational DB." True. If a relational DB scales to your data-set then wonderful for you -- you have options.

Real-time processing

What does "real" means in "real-time" term?

Real-time faces [W11]

Sub-second real time

real-time

In this kind of real-time data processing

nanoseconds count

real-time data processing

In this kind of real-time, pushing the limits of performance isn’t a bonus; it’s a necessity.

We think about this type of real-timing when we say

The sensors data is monitored in real-time to catch problems early.
This stock exchange application has to bid in real-time or we’ll lose money.

Human comfortable real-time response
This type of real-time is typical for ordinary users -- when they say real-time, they are usually referring to response time that not bore or frustrate the users. The performance requirement for this kind of processing is usually a couple of seconds -- performance matters, but it may not be the number one criteria. In some cases, a difference of a single second can be critical. For instance, if a person clicks on an ad on a web page, and the page takes 4 seconds to load, the user is likely to get bored and go look at a different web page. If that same page had loaded in 3 seconds, that user might have bought something on that web page. For the most part, however, as long as the application responds before the user decides to go surf somewhere else, or check email or something, then the performance requirement is met.

We think about this type of real-timing when we say

This website needs to respond to user requests in real-time or we’ll lose sales.
We need real-time visualizations for our business intelligence team, no matter how big the data.

Take an action as a response for some event
This type of real-time is typical for changes in the data or user actions -- when we say real-time, we are usually referring to response time opposite of scheduled. If something happen (an event occurs), do not wait but act -- so called event-driven processing model. The performance requirement for this is generally before another event happens. Instead of happening in a particular time interval, event-driven data processing happens when a certain action or condition triggers it. We don’t know precisely when we will need data processing done, but as soon as a certain thing happens, that’s when the need for data processing is triggered.

There are actually two different performance requirements for event-driven data processing. First, the data processing system has to be finished working and ready to start again before the next event happens. So, if on average, the events happen no closer together than five minutes, a data processing time frame of 2-3 minutes is excellent. If the events tend to happen an average of 10 seconds apart, then clearly, a 2-3 minute processing time would be unacceptable.

The second performance requirement is more arbitrary. It’s the busines SLA (Service-Level Agreement). If for example, we want to be able to assure that user dashboards have the most current data up to the minute, then the data processing has to be able to complete within a minute of any data change in order to meet that deadline.

We think about this type of real-timing when we say

As changes are made to the database, the replication process copies them out to the cluster in real-time.

Process data as they flow
This type of real-time is typical for processing the data as it flows in, one unit at a time -- when we say real-time, we are usually referring to response time opposite of batch. Because the data is processed as it flows in and because once the data starts coming in, it generally doesn’t end, we say about streaming data processing. The performance requirement for streaming data processing is we must process data as fast as the data flows in.

More and more, when people say real-time data processing these days, they are most likely referring to this type of data processing. Streaming data processing has some very specific, and sometimes tricky to implement, requirements. We have to be able to process the data continuously, without start-up or clean-up overhead. Streaming data processing also requires a way to deal with occasional system failures without massive data loss. In some cases, data loss is acceptable, but in others, it isn’t.

We think about this type of real-timing when we say

The server information in this data center is monitored in real-time to catch problems early.

Summary
So we have four different meanings of real-time term, and all of them are correct. This ambiguity tends some people to use "near real-time" term in the sense of any one of the abve definitions, aside from sub-second, to emphasize the fact that something must happen "immediately" but we can wait a "moment" for results. Of course both "immediately" and "moment" are very fuzzy and imprecise and highly dependent on the specific case.

Very neat summary statement is Respond before you lose the customer. This, in some ways, is the best possible way to think about real-time when designing any data processing systems. Regardless of what level of performance our system has in any given situation, if we end up losing the customer, then it’s simply too slow. This may be a premise to move up to real-time data processing.

Streaming data processing

real-time data processing

streaming data processing

stream processing

Why is stream processing needed?

It is needed to provide the expected value in a strictly specified time window. Big data established the value of insights derived from processing data. Such insights are not all created equal. Some insights are more valuable shortly after it has happened with the value diminishes very fast with time. Stream processing enables such scenarios, providing insights faster, often within milliseconds to seconds from the trigger.
Stream processing naturally fit with time series data and detecting patterns over time. Some data naturally comes as a never-ending stream of events. To do batch processing, we need to store it, stop data collection at some time and processes the data. Then we have to do the next batch and then worry about aggregating across multiple batches. In contrast, streaming handles neverending data streams gracefully and naturally. We can detect patterns, inspect results, look at multiple levels of focus, and also easily look at data from multiple streams simultaneously.
Stream processing let us handle large data and retain only useful pieces. Sometimes data is huge and it is not even possible to store it. Working with streams we can examine a small portion of data and drain what is important to us. We can get rid of data that does not carry important information or aggregate them into more compact form. For example, if we observe that temperature sensor returns the same value for the last hour, it is enough to store this value only once along with information about its validity -- timestamp when this value was recorded first time and when it was recorded last time.
Stream processing let us handle large data with reasonable hardware/software stack. With streams we process data as they come in hence spread the processing over time. Hence stream processing can work with a lot less hardware than batch processing (we have less data to process compared to batch processing). With the almost instant flow, systems do not require large amounts of data to be stored.
Stream processing enables approximate query processing via systematic load. Any data aggregation or compaction may entail approximation. If this is what we want/need/accept and approximate answers are sufficient for us, streams may be a good choice.

Stream (real-time) processing is also known as online processing. This type of workload can be characterized as

involving small amounts of data,
using random read/writes (from/to persistance storage),
and responding with low-latency.

Pros: In realtime mode, data is processed in-memory as it is captured before being persisted to the disk. Response time generally ranges from a sub-second to under a minute.
Cons: It is much harder to implement, set up and maintain compared to batch mode.

Summary -- batch to real-time comparison

In batch processing it processes over all or most of the data while in stream processing it processes over data on rolling window or most recent record. So batch processing handles a large batch of data while stream processing handles individual records or micro batches of few records. Batch processing lets the data build up and try to process them at once while stream processing process data as they come in hence spread the processing over time.
In the point of performance the latency of batch processing will be in a minutes to hours while the latency of stream processing will be in seconds or milliseconds.
Stream processing is adequate if processing can be done with a single pass over the data or has temporal locality (processing tend to access recent data).
Stream processing is inadequate if processing needs multiple passes through full data or need an access to the data outside current stream's window (range).
Real-time systems are usually reactive, meaning they behave based on the conditions of the environment.

SCV principle

Speed, Consistency and Volume (SCV) principle

CAP theorem

consistency

high availability

partition tolerance

speed

consistency

volume

Speed This refers to how quickly the data can be processed once it is generated.
Consistency This refers to the accuracy and the precision of the results. In common language both terms are used interchangeably as synonyms but they means differetn things. Results are deemed accurate if they are close to the correct value and precise if close to each other. To illustrate the fundamental difference between accuracy and precision, the analogy to a shooting target is instructive.
- Considering the case of a rifle with calibrated sighting scope in the hands of a professional marksman with a steady hand we will get accuracy and precision.
- Considering the result for a professional marksman using a rifle whose sighting scope is not calibrated we will get no accuracy and precision.
- Considering the result for an amateur (with a shaky hand) using a calibrated rifle we will get accuracy and no precision.
- Considering the result for an amateur shooting an un-calibrated rifle we will get no accuracy and no precision.
Volume This refers to the amount of data that can be processed.

The SCV principle states that

If speed (S) and consistency (C) are required, it is not possible to process high volumes of data (V) because large amounts of data slow down data processing.
If consistency (C) and processing of high volumes of data (V) are required, it is not possible to process the data at high speed (S) as achieving high speed data processing requires smaller data volumes.
If high volume (V) data processing with high speed (S) is required, the processed results will not be consistent (C) since high-speed processing of large amounts of data involves sampling the data, which may reduce consistency.

In Big Data environments, making the maximum amount of data available is mandatory for performing in-depth analysis (as for a distributed data storage). Assuming that data (V) loss is unacceptable, generally a realtime data analysis system will either be S+V or C+V, depending upon whether speed (S) or consistent results (C) is favored. Working with a realtime (Big Data processing) system, we consider a hard-realtime and a near-realtime (Big Data processing) system.
In consequence, because in the case of a hard-realtime system, a fast response (S) is required, hence consistency (C) will be compromised if high volume data (V) needs to be processed in memory. This scenario will require the use of sampling or approximation techniques, which will in turn generate less accurate results but with tolerable precision in a timely manner. In the case of a near-realtime system, a reasonably fast response (S) is required, hence consistency (C) can be guaranteed if high volume data (V) needs to be processed in memory.