Big Data paradigm – Tutorials

Informations presented on this page may be incomplete or outdated. You can find the most up-to-date version reading my book Engineering of Big Data Processing

In this part we cover the following topics

Paradigm
Factors we have to work with
A story
Desired properties of a Big Data system

Paradigm

The Oxford English Dictionary defines the paradigm term as "a typical example or pattern of something; a pattern or model".[Oxford, paradigm]. The term was introduce to contemporary vacabulary by the historian of science Thomas Kuhn when he adopted the word to refer to the set of concepts and practices that define a scientific discipline at any particular period of time. In his book, The Structure of Scientific Revolutions published in 1962, he defines a scientific paradigm as: "universally recognized scientific achievements that, for a time, provide model problems and solutions for a community of practitioners" [B4]

Word paradigm comes from Greek paradeigma, "pattern, example, sample" from the verb paradeiknumi, "exhibit, represent, expose" and that from para, "beside, beyond" and deiknumi, "to show, to point out".

In contemporary, common, science and philosophy meaning, a paradigm is a distinct set of concepts or thought patterns, including theories, research methods, postulates, and standards for what constitutes legitimate contributions to a field.

A paradigm from the dogma is distinguished by several essential features:

It is not given once and for all. It is adopted on the basis of the consensus of most researchers, which does not imply that the scientists vote on accepting or rejecting the paradigm. What counts is the paradigm's correspondence to existing knowledge and the fulfillment of many conditions (in terms of, for example, existing proofs).
It may periodically undergo fundamental changes leading to profound changes in science what is called the scientific revolution.
There is no something such absolute correctness.

A good paradigm has several essential features:

It is consistent logically and conceptually.
It is as simple as possible and contain only those concepts and theories that are really necessary for a given science.
It give the possibility of creating detailed theories in accordance with known facts.

The term dogma is transliterated in the 17th century from Latin dogma meaning "philosophical tenet", derived from the Greek dogma meaning literally "that which one thinks is true" and Greek dokeo "to seem good".

It is used in pejorative and non-pejorative senses.

In the non-pejorative sense, dogma is an official system of principles or tenets of a church, or the positions of a philosopher or of a philosophical school.
In the pejorative sense, dogma refers to enforced decisions, such as those of aggressive political interests or authorities. More generally it is applied to some strong belief that the ones adhering to it are not willing to rationally discuss. It is often used to refer to matters related to religion, but is not limited to theistic attitudes alone, and is often used with respect to political or philosophical ideas.

In this chapter, we intentionally talk about paradigms not about best practices. Why not best practices? -- you ask. Best practices? Because according to Toyota, and I agree with that, there is no such thing as best practice. When you say best practices, you behave like you have just found a Holy Grail and, being satisfied, you will never search again for any other improvements. You stuck in what you have. Although the key principles of Toyota’s Production System can be conveniently summarised, it is vital to remember it represents a way of thinking, not just a set of tools and techniques. For those of you looking for a quick fix you may as well stop reading now. There isn’t one. It’s taken Toyota 60 years to create what they have and they still haven’t finished!

The Toyota Production System

Adopt A Long Term Philosophy: It’s Not Just About Making A Profit
- Until most recently, Toyota was different from most companies. Decisions are not driven entirely by satisfying short term profit and loss forecasts. They actually think beyond the desire to satisfy the accountants. Toyota plan and act with the long term in mind, not the quarterly results.
- Toyota’s mission and guiding principles is not a poster on the wall, they’re practiced each and every day.
Investment In People: "We Build People Before We Build Cars"
- Every company says its products are only as good as its people. Toyota actually means it. They invest considerable effort in recruiting and keeping the very best. The recruitment of an assembly line worker can take up to 14 months before they are offered a full time contract. When asked: How can you afford this kind of recruitment process, Toyota’s response is swift How can YOU afford not to?
- Toyota reward team performance, not that of the individual. Promotion can take time. Slow promotion and rewards for team work is the norm.
- Toyota sees ongoing education as vital. They have even built their own University to ensure a steady flow of high quality engineers! They regard education as an investment, not a cost.
- Managers are seen as mentors and coaches, teachers not dictators.
- Toyota generally promote from within. They grow leaders rather than purchasing them.
Focus On The Production Line: Keep The Main Thing The Main Thing
- Everyone focuses on servicing the production line. Every other activity is seen as non value added.
- An upside down management structure. Team leaders and managers support those at the sharp end, not the other way around. Assembly line operators are at the top of the pyramid, not the bottom.
- There is a genuine belief that only assembly operators and engineers add value. Everyone else must justify their reason for being!
- Everyone must serve their apprenticeship and develop deep understanding of the process. Even those in HR have "time served" on the assembly line.
- Managers and team leaders spend up to 80% of their time on the production line solving problems and adding value -– not in meetings or "emailing".
- Leaders manage from where the work is done, not from their office.
A Genuine Learning Organisation: "Solving Problems Is Key To Our Success"
- Toyota intentionally runs very, very lean. There are no buffer stocks to fall back on. Running lean means that problems can have a dramatic impact on production output. Problems have to be fixed, and fixed quickly.
- The pervading attitude is that problems and process deviations are really good news, providing you learn from them. It’s accepted that "real time" solving problems at source saves time and money later on downstream.
- Toyota famously adopts the "andon" approach to problem solving. Deviations are acted upon within minutes and always solved at the site of the incident, never from behind a desk. Deviations are brought to the surface quickly and solved within hours.
- Suggestion schemes are used to generate new ideas and ways of working with over 90% of suggestions implemented. Payments are weighted towards the small incremental improvements not the big ones.
Only Focus On The Manufacturing Process And Nothing Else
- They rarely use any new or unproven technology.
- Technology is only used when it can add value and keep things simple.
- Toyota’s attitude is that people do the work, computers only move the information. If technology distracts or confuses the user, it is simply not used.
Standardisation Is The Name Of The Game
- Standardisation is about finding out the scientifically best way of doing a task, proving that it works and then "freezing" it. Although people are expected to "follow the rules" SOPs (SOP -- Standard Operating Procedure) are not allowed to stifle innovation and further improvement. Users are encouraged to share best practice, "hints and tips" and improve SOPs further. SOPs are constantly reviewed and improved, not every 2-3 years!
- The level of procedural compliance is very high for one simple reason –- user involvement. SOPs are developed from the bottom up not from the top down. Management have very little input. Users are seen as the document owner. They write, design, refine and implement all new SOPs. Not surprisingly compliance is not a problem because they usually work!
The War On Waste: It Never Stops
- Anything that adds no value is removed from the system.
Performance Measures: Less Is More!
- Less is more. Toyota measures only what is important and avoids the "death by measure".
- They only select measures that drive the right behaviour. For example, assembly line workers are rewarded for raising deviations. After all, you learn more from your mistakes than from your successes. Contrast this approach with that taken by many [...] companies, namely to encourage people to reduce deviation numbers. This measure drives completely the wrong behaviour!

QUALITY MANAGEMENT BEST PRACTICES MANAGING QUALITY. THE ‘TOYOTA WAY’!, https://www.hpcimedia.com/images/PDF/NSF.pdf, retrieved: 2019-03-14

Knowing what a paradigm is, we can try to find some related with Big Data. We will do this throught the whole set of lectures but in this lecture we will try to give the foundations, most common rules and ideas.

Factors we have to work with

There have been a number of trends in technology that deeply influence the ways in which we can/should think about/build Big Data systems.

CPU limits

The most profound change is related to CPU which seems to hit the physical limits of how fast a single CPU can be.

All we know the Moore's Law -- an empirical law, resulting from the observation that the economically optimal number of transistors in an integrated circuit increases in subsequent years in accordance with the exponential trend (it doubles in almost equal lengths of time). The authorship of this law is attributed to Gordon Moore, one of the founders of Intel, who in 1965 described a doubling every year in the number of components per integrated circuit, and projected this rate of growth would continue for at least another decade. In 1975, looking forward to the next decade, he revised the forecast to doubling every two years. The period is often quoted as 18 months because of Intel executive David House, who predicted that chip performance would double every 18 months (being a combination of the effect of more transistors and the transistors being faster).

One of the main reasons why this exponential growth is possible is the use of smaller and smaller elements in the fabrication process. Today, 65, 45, 32 and recently 14 nm dominate, when 500 nm technology was used in the early 1990s. Taking into account classical physics, these dimensions can not be reduced indefinitely -- the limit is the size of atoms, and the next limitation is the speed of light in a vacuum that sets the upper limit for the speed of information transmission.

That is why the trend of the entire computer industry is directed towards the creation of multiprocessor (or multi-core) systems and parallel processing (used so far in efficient servers and supercomputers). Also this form of Moore's law extension has its limits in the form of Amdahl law and high delays in access to data. Another big problem of current technologies is high power consumption and heat generated.

In computer architecture, Amdahl's law is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. It is named after computer scientist Gene Amdahl, and was presented at the AFIPS Spring Joint Computer Conference in 1967.

Amdahl's law is often used in parallel computing to predict the theoretical speedup when using multiple processors. For example,

If a program needs 100 hours using a single processor core, and a particular part of the program which takes 50 hour to execute cannot be parallelized, while the remaining 50 hours ($p=0.5$) of execution time can be parallelized, then regardless of how many processors are devoted to a parallelized execution of this program, the minimum execution time cannot be less than that critical 50 hours. Hence, the theoretical speedup is limited to at most 2 times (1/(1 − p) = 2).
If a program needs 100 hours using a single processor core, and a particular part of the program which takes 5 hour to execute cannot be parallelized, while the remaining 95 hours (p=0.95) of execution time can be parallelized, then regardless of how many processors are devoted to a parallelized execution of this program, the minimum execution time cannot be less than that critical 5 hours. Hence, the theoretical speedup is limited to at most 20 times (1/(1 − 0.95) = 20).

According to Amdahl's law the theoretical speedup of the execution of the whole task increases with the improvement of the resources of the system and that regardless of the magnitude of the improvement, the theoretical speedup is always limited by the part of the task that cannot benefit from the improvement. Any theoretical speedup is limited by the serial part of the program. For this reason, parallel computing with many processors is useful only for highly parallelizable programs.

Single core speed is an important CPU test for consumers. This test only stresses one processing core at a time so it's a good measure of individual core performance rather than total calculation throughput.

Multi core speed is a server orientated CPU benchmark. This test is more appropriate for measuring server rather than desktop performance because typical desktop workloads rarely exceed two cores.

CPU	Single core	Multi core
Intel Core i7-8700K 3.7 GHz (6 cores)	5943	25889
Intel Core i7-7740X 4 Cores, 8 Threads @4.30 GHz Kaby Lake (best as single core)	141	752
Intel Core i9-7980XE 18 Cores, 36 Threads @2.60 GHz Skylake (best as multi core)	123	2716

So parallelization is the key. Comparing i9 with i7 working with just one thread could be a reason to be ashamed for i9 (i9: 123 vs. i7: 141). Working with many threads makes i9 the real king (i9: 2716 vs. i7: 752). That means, if we want to scale to more data, we must be able to parallelize our computation. Instead of just trying to scale by buying a better machine, known as vertical scaling, systems scale by adding more machines, known as horizontal scaling.

Now that Intel’s latest 9th-gen Core mobile chips are on their way, it’s time to figure out whether it’ll be worth it to pay top dollar for the new chip; buy or keep a laptop with an 8th-gen CPU (plenty of current models remain available), or upgrade from your 6th-gen or 7th-gen model.

The issues differ from generation to generation. In many cases, we’ve said you can wait a few years before upgrading. The 8th-generation jump was an exception: It represented one of the biggest laptop CPU improvements in a long time, worthy of upgrading even from a 7th-gen chip.

With the 9th gen, however, we’re mostly back to the incremental clock speed upgrades we’ve seen from Intel for years. That’s not to say it’s all underwhelming -- because you can get that fancy new 802.11ax/Wi-Fi 6, too. But for those interested in speed boosts, the sparse improvements in core counts mean you can largely ignore this series except for the Core i9 lineup.

Should you buy a laptop with 8th-gen or 9th-gen Core CPU? It's all about cores and clocks, retrieved: 2019-05-07

As we can see in the note above, the only true reason that one can be interested in upgrading system is rather the number of cores than clock speed upgrade ([...] the sparse improvements in core counts mean you can largely ignore this series [...]).

Think parallel

Hardware rental

Everybody heard about cloud technology. We can say that clouds are everywhere: on the sky which is obvious, and in our life regardless we realize about that or not. Many services are cloud-based and this seems to be an irreversible trend. Despite cloud is trendy, it offers something else: great elasticity. Today it doesn't pay to own our own hardware in our own location. Much easier and cheaper is to rent hardware on demand which is known as Infrastructure as a Service (IaaS). With this we can increase or decrease the size of our infrastructure nearly instantaneously with just one-button click, depending on the job we are going to do. So today horizontal scaling, i.e. adding or removing nodes (computers) to/from a system, is not a problem and can be completed on demand within just a few seconds.

Rent or own

Software availability

Nowadays word are driven by free, open sourced and easy to make first step software (get it, install it and run it) and we have a lot of options to choose from. In consequence, we can easily mix different technologies and test them before we will use them in real production system. This give us a great flexibility in terms of tools we use as far as a high adaptability to follow changing market requirements.

Mix technologies

Need to be easy adaptable to follow changing market requirements

In today's world...

Be felxible, be agile

Use what you have

We can say that there are two options: either we can use specialized, carefully tuned one-node system or we can use easy available commodity nodes (components). Whether it pays off, it depends on what is important to us. Let's point out some aspects.

Even specialized, carefully tuned one-node system can not be used forever and extended infinitely. There will come a time when we will have to think not about modifying but about replacing the system.
What is cheaper to buy: one one-node monolithic system or ten commodity nodes?
What is cheaper in everyday running: one one-node monolithic system or ten commodity nodes?
What is easier to maintain: one one-node monolithic system or ten commodity nodes?
What is easier to use: one one-node monolithic system or ten commodity nodes? DB problem - distributed queries.
What is easier to programm: one one-node monolithic system or ten commodity nodes?
What is easier to extend: one one-node monolithic system or ten commodity nodes?

High-tech may not pay off

Summary

Presented trends are something we have to face with. Wheter we agree or not, they are real things and we can not get away from them. They influence the ways in which we can/should think about/build Big Data systems. These trends are

CPU limits;
hardware rental;
software availability;
need to be agile;
commodity nodes usage.

A story

Hitting the limits

A single insert at a time to the database.

Queue to the rescue

It can be more efficient if we batch many inserts in a single request. So we re-architect our back end to make this possible.

Sharding

One of the best approaches to scale a write-heavy relational database is to use multiple database servers and spread the table across all the servers. Each server will have a subset of the data for the table. This is known as horizontal partitioning or sharding. This technique spreads the write load across multiple machines.

The problem with this solution is that all of our application components need to know how to find the correct shard for each key. So we need wrap a library around our database-handling code that reads the number of shards from a configuration file, and we redeploy all of our application code.

Another problem with shareds is that we have to coordinate local results (obtained on every shared) to get one global result. It can be easily done for simple tasks like geting the maximum value. Unfortunatley even simple tasks becomes much harder when we "a little" change requirements: for example get top 10 values instead one maximum.

In both cases, the more shareds we have, the more work to coordinate we have.

Fault-tolerance

With many shards it becomes a not-infrequent occurrence for the disk on one of the database machines to go bad. That portion of the data is unavailable while that machine is down. We do a couple of things to address this:

We update our queue/worker system to put increments for unavailable shards on a separate "pending" queue that we attempt to flush once every five minutes. This sounds nice until queue subsystem also fails.
We use the database’s replication capabilities to add a slave to each shard so we have a backup in case the master goes down. We don’t write to the slave, but at least customers can still view the stats in the application.

Data corruption

Mistakes are not uncommon but they are an inseparable part of our live.

Summary

The main idea of presented story was to point out the most obvious and often found problems when our data becomes big.
These problems are

traditional (relational) databases limits,
insufficient fault-tolerance,
insufficient protection against data corruption.

Desired properties of a Big Data system

In the precedings sections some trends and problems were described. All of them should be taken into account if we want to "define" any patterns or models. Have in mind, that no pattern or model is ethernal and given once for all and should be verified and re-adopted on the basis of the state of the art knowledge, theory and practise.

A Big Data system should:

be scalable;
be distributed (the databases and computation systems we use for Big Data should be aware of their distributed nature);
be fault tolerant (compare with Data model for Big Data) lecture;
be able to support and interoperate with a wide range of applications/systems;
be easy adaptable to follow changing market requirements;
be easy extensible.