NoSQL – Tutorials

Informations presented on this page may be incomplete or outdated. You can find the most up-to-date version reading my book NoSQL. Theory and examples

In this part we cover the following topics

NoSQL

NoSQL

What is and is not NoSQL?

For our purpose, we will use NoSQL definition I have found in [1] and I liked it very much

NoSQL is a set of concepts that allows the rapid and efficient processing of data sets with a focus on performance, reliability, and agility.

As we can see, we don't discredit any existing databases system, including deserved RDBMS systems. As we can see from the above definition in the NoSQL solutions, we focus primarily on

fast processing,
efficient processing,
reliable processing,
agile processing.

So what is NoSQL?

It’s much more than rows in tables. NoSQL systems store and retrieve data from many formats: key-value, graph, column-family, document, and of course rows in tables.
It’s free of joins. NoSQL systems allow us to extract our data using simple interfaces without joins.
It’s schema-free. In most cases we don't have to create an entity-relational model.
NoSQL systems works on multiple processors and can run on low-cost separate computer systems - no need for expensive nodes to get high-speed performance.
It supports linear scalability. Every time we add more processors, we get a consistent increase in performance.
NoSQL is a response to nowadays business data related factors
- volume and velocity, referring to the ability to handle large datasets that arrive quickly;
- variability, referring to how diverse data types don’t fit into structured tables;
- and agility, referring to how fast an organization responds to business changes.
NoSQL is Not only SQL.

OK, so what is not NoSQL?

It’s not against the SQL language. SQL as well as other query languages are used with NoSQL databases.
It’s not only about big data. Many NoSQL applications were (and still are) driven by the inability of an application to efficiently scale when big data is an issue. Though volume and velocity are important, NoSQL also focuses on other factors like variability and agility.
It’s not about cloud computing. Many NoSQL systems, as many contemporary applications, reside in the cloud to take advantage of its ability to rapidly scale when the situation dictates. NoSQL systems are not unique in this field but can run in the cloud as well as in the corporate data center.
It’s not close group of companies, software or product. Anybody can be a big player in this market if only offer innovative solutions to business problems.

So NoSQL is not one specific, well defined technology. It is rather effect of wide searching for existing business problem's solutions without stubbornly stick to existing solutions. Everything which can use to solve existing, and saying the truth, more and more sophisticated business problem is worth the attention.

Motivation

We know what an NoSQL is (not). Another question is Why NoSQL is as it is?, What factors led designers to project it that way? Below there are some factors which could not be fulfilled by existing databases.

Flexibility. One drawback when working with relational database is that we have to know many things (or all of them) in advance before we start using this database. We are expected to know all the tables and columns that will be needed to support an application. It is also commonly assumed that most of the columns in a table will be needed by most of the rows. In practice sometimes we have no idea how many and what types of columns we will need or we add a column to our database just to support an option which happens only once in a million cases.
Moreover, very often our choice changes frequently as we develop our bussines or IT infrastructure and identify importan factors or redefine those already known. This may be named as agile approach to data -- the flexibility to collect, store and processing any type of data any time its appear.

Flexibility allows us also reduce costs, which is highlighted in the one of the following items.

Availability. Today's world is full of information in a sense that we can find an information we are looking for in many different sources. We know it, so if one of the sources is unavailable we will not wait for it but more likely will shift to another one. Think about the way you use web browser to find the webpage with information you need. In just a second you have thousands of results. In consequence if you click on the first link and wait more than 3 seconds to see a resulting webpage you become irritated and click second link in hope to get result in less then one second. If you don't believe, read notes below.

Wymagania użytkowników w zakresie dostępności baz danych są co raz bardziej rygorystyczne. Po pierwsze, oczekują oni, że w przypadku ewentualnej awarii systemu nie utracą żadnych danych bądź jedynie ich ściśle określoną ilość -- ten wymóg pozwala spełnić wdrożenie odpowiedniej strategii tworzenia i odtwarzania kopii zapasowych, czyli rozwiązań klasy DR (ang. Disaster Recovery). Jednak usunięcie awarii i odtworzenie bazy danych może zająć dużo czasu. Dlatego spełnienie drugiego oczekiwania użytkowników, a mianowicie zapewnienie ciągłej dostępości systemów bazodanowych, nawet w przypadku uszkodzenia ich poszczególnych elementów czy nagłego wzrostu obciążenia, wymaga zastosowania odpowiedniej technologii ciągłej dostępności (ang. High Availability -- HA).

Zagwarantowanie ciągłej pracy systemu bazodanowego jest jednym z najważniejszych zadań jego administratora. Polega ono na zapobieganiu niedostępności usługi w wyniku awarii sprzętu lub oprogramowania, błędów użytkowników, katastrof naturalnych itp. poprzes ochronę zasobów (takich jak dyski czy baza danych) za pomocą odpowiedniej technologi HA.

Niedostępność usługi może być wcześniej zaplanowana, na przykład spowodowana koniecznością przeprowadzenie prac modernizacyjnych, instalacji poprawek lub wymianą sprzętu, albo niezaplanowana, będąca skutkiem wystapienia błędu uniemożliwiajacego pracę użytkowników w sposób przez nich odczuwalny. Oba rodzaje niedostępności powinny być uwzględnione w uzgodnionej z użytkownikiem systemu umowie SLA (Service Level Agreement, umowa o gwarantowanym poziomie świadczenia usług).

W najprostszym przypadku umowa SLA powinna określać czas dopuszczalnej niedostępności usługi obliczony według wzoru

	                  MTBF
	Dostępność = ---------------
	               (MTBF+MTTR)

MTBF

Dostępność = ---------------

(MTBF+MTTR)

gdzie

MTBF (ang. Mean Time Between Failures) oznacza średni czas pomiędzy awariami,
MTTR (ang. Mean Time To Repair) średni czas potrzebny do usunięcia awarii.

Tak obliczona dostępność z reguły przedstawia się za pomocą liczby dziewiątek

Dostępność	Czas niedostępności w ciągu miesiąca	Czas niedostępności w ciągu roku
95%	36 godzin	18 dni
99%	7 godzin	3.5 dnia
99.5%	3.5 godziny	<2 dni
99.9%	43m i 12s	8h i 45m
99.99%	4m i 19s	52m i 36s
99.999%	25s	5m i 15s

[...]powinniśmy zauważyć, że ciągła dostępność baz danych zależy też od właściwej konfiguracji serwera i odpowiedniego wykorzystania jego funkcjonalności.

I tak zmiana niektórych opcji konfiguracyjnch serwera SQL Server, np. włącznie możliwości wykonywania zewnętrznych skryptów, wymaga restartu usługi. Dlatego w pierwszej kolejności należy właściwie skonfigurować serwer, którego ciągłą dostępność mamy zapewnić.

[A7] Marcin Szeliga, Wysoka dostępność serwerów SQL. Klastry niezawodności infrastruktury, IT professional, 11/2018, p. 28

Amazon S3 is designed for 99.999999999% (11 9's) of durability, and stores data for millions of applications for companies all around the world. [W18]

Jednym z przykładów wdrożenia usługi disaster recovery [...] aby zapewnić kontynuację biznesową w przypadku awarii, jest Onet -- jeden z największych portali internetowych. W przypadku awarii [...] Onet w 15 minut może uruchomić podstawowe elementy swoich aplikacji, natomiast w 4 godziny zostanie przywrócony cały portal.

8000USD -- nawet tyle może kosztować małe firmy godzinny przestój wynikający z braku funkcjonowania systemu informatycznego. W przypadku przedsiębiorstw średniej wielkości koszty te mogą wynieść odpowiednio 74000USD, a duże firmy mogą ponieść nawet 700000USD strat.

Każdy dolar wydany na zapobieganie zagrożeniom i ich łagodzenie pozwala zaoszczędzić cztery dolary stanowiące koszty przywracania zdolności operacyjnej po awarii.

[A8] Jerzy Michalczyk, Gdy trwoga to do... chmury. Disaster recovery as a Service, IT professional, 9/2018, p. 42-42

A recent survey on download decisions claimed that the description of the app was the second most important reason for install, with rating coming in as number one. The problem is, what people say and what they actually do are often two very different things, so I set out to find the truth about the reasons why users download the apps that they do — and discover why some apps stay ignored. [W20]

Screenshots or Thumbnails
Less than 4% of people coming to your app page tap on your screenshots. [...] fewer than 4% of users looking for an app enlarge portrait screenshots, and only 2% enlarged landscape screenshots. For gamers, it’s even less at just 0.5%. This is probably because the gameplay is usually clear enough even from thumbnails.

Let’s look at some screenshots redesign from 2016 to 2018. Notice how almost every one of them have fewer words and bigger fonts.

Application	2016	2018
airbnb	words: 6, font size: 43	words: 4, font size: 91
tinder	words: 6, font size: 50	words: 3, font size: 112
Spotify	words: 4, font size: 45	words: 3, font size: 100
Pinterest	words: 10, font size: 40	words: 3, font size: 90

More is Better?
78 of the top 100 apps have five screenshots, 13 apps have four screenshots, 6 apps have three screenshots and 3 only have two. As a developer, you would think to go for five screenshots because more content is better, right? Wrong.

Only 9% of users scroll past the first two screenshots. Landscape screenshots perform worse at 5%.
Highlighted UI Elements
Users glancing at your screenshots are trying to gauge the functionalities of your app. Text captions help them understand the context behind the screens. Designers are making it even easier for users by highlighting UI elements that the text caption is trying to explain.

Compare the resulting old and new App Store screenshots.

[W19] UX Best Practices: How to Design Scannable App Screenshots, access: 2018-11-06
[W20] Why 7 seconds could make or break your mobile app, access: 2018-11-06

Speed. For the customer's point of view, availability seems to be the most important factor. Relational architecture with lots of constraints set on data (tables) and additional rules (e.g. databases trigers) and needs of splitting every object into the smallest pieces (normal forms) does not support fast data saving. It would be much useful to allow store immediately anything we want at the time we want deffering all rules checking to some later point in the time.
Moreover, very often relational complexity is no needed. We don't need sophisticated logic to handle all the articles from a blog. This logic can be easily implemented in application code in favour of data availability.
Cost. One option is that we can use one expensive server and second for it's backup. At the beginning most of its resources will be unused. Load will increase through time and reaches its limit in five years but you have to pay for this now - it's a waste of money. Other option is to buy low-cost machine(s) sufficient for current needs. When load will increase a little bit, we can add one more node to our cluster. We can do such a small steps every time a load goes up. Thus we can save some money, which is important especially when we start our business.
Scalability. Scaling is the only way we can meet above requirements. It is important that this should be as easy and as with as little additional costs as it is only would be possible.
Simply speaking, scalability is the ability to efficiently meet the needs for varying workloads. We distinguish two types of scaling
- Scaling up. We do this, upgrading an existing database server to add additional processors, memory, or any other resources that would improve performance on a database management. We have similar effect replacing an existing server with one more powerful.
- Scaling out. We do this adding (not upgrading or replacing) servers as needed. This type of scaling is more flexible than previous.

Depending on the needs of a particular application, some of these factors may be more important than others - this explains why NoSQL family is so big.

BASE

Relational database systems have their ACID set of rules. By chemical analogy, NoSQL systems have their BASE set of rules. While ACID systems focus on high data integrity, NoSQL systems take into consideration a slightly different set of constraints named BASE.

ACID systems are focus on the consistency and integrity of data above all other considerations. Temporarily blocking is a reasonable price we have to pay to ensure that our system returns reliable and accurate information. ACID systems are said to be pessimistic in that they must consider all possible failure modes in a computing environment. According to Murphy’s Law: if anything can go wrong it will go wrong, and ACID systems are ready for this and guarantee that they will survive.

In contrast, BASE systems focus something significantly different: the availability. BASE systems most important objective is to allow new data to be stored, even at the risk of being out of sync for a short period of time. BASE systems aren’t considered pessimistic in that they don’t worry about the details if one process is behind. They’re optimistic in that they assume that eventually, in no so distance future, all systems will catch up and become consistent.

BASE is the alternative to ACID. It stands for these concepts

Basic availability means that the database appears to work most of the time. It allows systems to be temporarily inconsistent so that transactions are manageable. In BASE systems, the information and service capability are basically available. This means that there can be a partial failure in some parts of the distributed system but the rest of the system continues to function.
Soft-state means that stores don’t have to be write-consistent, nor do different replicas have to be mutually consistent all the time. Some inaccuracy is temporarily allowed and data may change while being used. State of the system may change over time, even without input. This is because of eventual consistency.
Eventual consistency means that there may be times when the database is in an inconsistent state. Eventually, when all service logic is executed, the system is left in a consistent state.

CAP theorem

The CAP theorem is about how distributed database systems behave in the face of network instability.

When working with distributed systems over unreliable networks we need to consider the properties of consistency and availability in order to make the best decision about what to do when systems fail. The CAP theorem introduced by Eric Brewer in 2000 states that any distributed database system can have at most two of the following three desirable properties

Consistency. Consistency is about having a single, up-to-date, readable version of our data available to all clients. Our data should be consistent - no matter how many clients reading the same items from replicated and distributed partitions we should get consistent results. All writes are atomic and all subsequent requests retrieve the new value.
High availability. This property states that the distributed database will always allow database clients to make operations like select or update on items without delay. Internal communication failures between replicated data shouldn’t prevent operations on it. The database will always return a value as long as a single server is running.
Partition tolerance. This is the ability of the system to keep responding to client requests even if there’s a communication failure between database partitions. The system will still function even if network communication between partitions is temporarily lost.

Note that the CAP theorem only applies in cases when there’s a connection failure between partitions in our cluster. The more reliable our network, the lower the probability we will need to think about this theorem. The CAP theorem helps us understand that once we partition our data, we must determine which options best match our business requirements: consistency or availability. Remember: at most two of the aforementioned three desirable properties can be fulfilled, so we have to select either consistency or availability.

Consistency

The CAP theorem is important when considering a distributed database, since we must make a decision what we are willing to give up. The database we choose will lose either availability or consistency. Partition tolerance is strictly an architectural decision -- we have to make a decision if the database will be distributed or will not. Distributed databases, by definition, must be partition tolerant, so the choice between availability and consistency must be made. This can be difficult. However, while CAP dictates that if we choose availability we cannot have true consistency, we can still have something which is known as eventual consistency. One of the best known system of this type is the Domain Name Service (DNS). When a new domain is registered, it may take a few days to propagate information about it to all DNS servers across the Internet. But all the time all DNS servers are available and have almost accurate information. So, the idea behind eventual consistency is that each node is always available to serve requests with the data they currently have. As a trade-off, data modifications are propagated in the background to other nodes and this process may take some time. This means that at any time the system may be inconsistent, but the data is still almost accurate.

Reading about NoSQL databases we can face the concept of quorum. A quorum is the minimal number of nodes that must respond to a read or write operation to be considered complete. Of course having maximum quorum and querying all servers is the way we can determine the correct result. In most cases such a big quorum is not what we want because we have get it for the price of longer response time. We can vary the threshold to improve response time or consistency. If the read threshold is set to 1, we get a fast response. The lower the threshold, the faster the response but the higher the risk of returning inconsistent data. Our goal is to have quorum as small as only we can to guarantee true consistency.

Following the quorum idea we can talk about durability which is the property of maintaining a correct copy of data for long periods of time. As we can adjust a read threshold to balance response time and consistency, we can also adjust a write threshold to balance response time and durability. With this threshold a write operation is considered complete when a minimum number of copies have been written to persistent storage.

Going back to eventual consistency, we can distinguish several typy of it

Casual consistency. Casual consistency means that the database reflects the order in which operations were updated.
Read-your-writes consistency. Read-your-writes consistency means that once we have updated a record, all of our subsequent reads of that record will return the updated value.
Session consistency. Session consistency means read-your-writes consistency but at session level. Session can be identified with a conversation between a client and a server. As long as the conversation continues, we will read everything we have wrote during this conversation. If the session ends and we start another session with the same server, there is no this guarantee that we can read values we have wrote during previous conversation.
Monotonic read consistency. Monotonic read consistency means that whenever we make a query and see a result, we will never see an earlier version of the value.
Monotonic write consistency. Monotonic write consistency means that every time we make several update commands, they would be executed in the order we issued them.

NoSQL pros and cons

NoSQL is undoubtedly characterized by the following set of positive features

Relax ACID rules in favor of BASE which in many cases is a price we can pay.
Schema-less increase processing flexibility.
It’s easy to store high volume of high variability data arriving with high velocity.
In many cases modular architecture allows components to be exchanged.
Possible linear scaling as new processing nodes are added to the cluster.
Possible low(er) operational costs

To be honest one cannot forget about negative side of NoSQL

ACID transactions are not supported which demands different way of thinking compared to relational model.
Lack od build in security mechanism like views or roles.

Why data stores?

NoSQL databases are very often referred to as data stores rather than data-bases. We do this since they lack features we may expect to find based on our traditional, relational, databases experience like for example typed columns, triggers, or query languages.