Column family (BigTable) stores

In this part we cover the following topics

The origins

The data driven world

We live in a data driven world and there are no premises that this situation will change in the nearest future. Now we are witnesses of something which can be called data revolution. As the industrial revolution has changed the way we think about production from being created essentially by hand in house to being created in assembly lines that are located in big factories so our data sources also has changed. Time ago all data was generated locally "in house". Now, data comes in from everywhere: our actions (visiting web pages, online shops, playing games , etc.), social networks, various sensors , etc. We can say, that situation is even worse: data becomes a perpetuum mobile: any set of data generates twice as much new data which generates twice as much new data which...

More data allows us to use more efficient and sophisticated algorithms. Better algorithms means that we can work better, faster, safer and our general satisfaction with the use of technology is increasing. The more people is satisfied with the technology they use, the more new users will it attract. More users generate more data. More data allows us to...

From Big Data through Big Table to Hadoop

There are people who says: There is nothing more practical than good theory. I don't like this statement and instead say: There is nothing more theoretical than good practice. Hadoop is a good example of this: practical needs have led to the new computational paradigm and one of the best known type of the NoSQL databases: column family.

The story began when PageRank was developed in Google. Without getting into the details of which interested readers can read in a books or internet, one can say that PageRank allowed the relevance of a page to be weighted based on the number of links to that page instead of simply indexed on keywords within webpages. With this we are making implicit assumption that the more popular a page is the more links to that page can be found in the whole network. This seems to be reasonable: if I don't like someone's page I don't have a link to it and conversely, I have links to the pages I like. PageRank is a great example of a data-driven algorithm that leverages the collective intelligence (known also as wisdom of the crowd) and that can adapt intelligently as more data is available which is very close related with another trendy and popular term -- machine learning -- close related with big data which in turn is related with NoSQL.

Predicted large amount of data to be processed, PageRank required the use of non-standard solutions.

• Google decided to extensively use massively parallelizing and distributing processing across very large numbers of low budget servers.
• Instead of dedicated storage servers built and operated by other companies, storage based on disks directly attached to the same servers which perform data processing.
• Redefine unit of computation. No more individual servers. Instead the Google Modular Data Center were used. The Modular Data Center comprises shipping containers that house about a thousand custom-designed Intel based servers running Linux. Each module includes an independent power supply and air conditioning. Thus, data center capacity is increased not by adding new servers individually but by adding new modules with 1,000 servers each.

We have to emphasize here the most significant assumption made by Google: data collecting and processing are based on highly parallel and distributed technology. There was only one problem: such a technology didn't exist at the time.

To solve this problem, Google developed its own uniques software stack. There were three major software layers to serve as the foundation for the Google platform.

• Google File System (GFS). GFS is a distributed cluster file system that allows all of the disks to be accessed as one massive, distributed, redundant file system.
• MapReduce. A distributed processing framework that allows run parallel algorithms across very large number of servers. Architecture assumes nodes' unreliability, and dealing with huge datasets. MapReduce is a framework for performing parallel processing on large datasets across multiple computers (nodes). In the MapReduce framework, the map operation has a master node which breaks up an operation into subparts and distributes each operation to another node for processing. The reduce is the process where the master node collects the results from the other nodes and combines them into the answer to the original problem.

In some sense MapReduce resembles divide and conquer approach in algorithmic which is a design paradigm based on multi-branched recursion. A divide and conquer algorithm works by recursively breaking down a problem into two or more sub-problems of the same or related type, until these become simple enough to be solved directly. The solutions to the sub-problems are then combined to give a solution to the original problem.

• BigTable. A database system that uses the GFS for storage. It means that this was highly distributed layer. From previous material we guess that it couldn't based on relational model.

In this architecture data which are going to be computed are divided into large blocks and distributes across nodes. Then packaged code is transferred into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

Our point of interest in this tutorial is the last component: the database layer, so we will skip all the further information related to the previous two components.

The Google stack as a directly available software wasn't generally exposed to the public. Fortunately details of it were published in a series of papers ([A1], [A2], [A3]). Based on the ideas described in these document, and having positively verified working examples (Google itself), an open source Apache project being an equivalent of Google stack, has been started. This project is known under the Hadoop name.

In 2004, Doug Cutting and Mike Cafarella were working on an open-source project to build a web search engine called Nutch, which would be on top of Apache Lucene. Lucene was a Java-based text indexing and searching library. The Nutch team tend to use Lucene to build a scalable web search engine with the explicit intent of creating an open-source equivalent of the proprietary technologies used by Google.

Initial versions of Nutch could not demonstrate the required scalability allowing to index the entire web. A solution comes from the GFS and MapReduce thanks to Google's published papers (2003, 2004). Having an access to a proven architectural conception, the Nutch team implemented their own equivalents of Google's stack. Because MapReduce was tuned not only to web processing, it soon became clear that this paradigm can be used to a huge number of different massively data processing tasks. The resulting project, named Hadoop, was made available for download in 2007, and became a top-level Apache project in 2008.

In some sense Hadoop has become the de facto standard for massive unstructured data storage and processing. If you don't agree, say why giants such as IBM, Microsoft or Oracle offers Hadoop in their product set.

We can say that no any other project before has had as great influence on Big Data as Hadoop. The main reasons for this were (and still are)

• Accessibility Hadoop was free - everyone can download, install and work with it. Everyone can do this because we don't need superserver to host this software stack - we can install it even on a low budget laptop.
• Good level of abstraction GFS abstracts the storage contained in the nodes. MapReduce abstracts the processing power contained within these nodes. BigTable allows easy storage of almost everything. One week of self training is enough to start using this platform.
• Perfect scalability We can start very small company almost without any costs. Over time, as we become more and more recognizable, we can offer "more" without needs to architectural changes - we simply add another nodes to get more power. This allows many start-ups to offer good solutions based on a proven architectural conception.

Apache Hadoop's main components: MapReduce -- a processing part, and Hadoop Distributed File System (HDFS) -- a storage part were inspired by Google papers on their MapReduce and Google File System. Hadoop’s architecture roughly parallels that of Google. It consists of computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common situation and should be automatically handled by the framework.

The base Apache Hadoop framework is composed of the following modules:

• Hadoop Common – contains libraries and utilities needed by other Hadoop modules.
• Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines.
• Hadoop YARN – a platform responsible for managing computing resources in clusters and using them for scheduling users' applications.
• Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale data processing.

Nowadays the term Hadoop refers not just to the base components, but rather to the whole ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop:

• Apache Pig Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language used for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce or Apache Spark. With Pig Latin we have one more layer in Hadoop stack abstracting the Java based MapReduce approach into a notation which allows high level programming with MapReduce.

Let's look in a basic example of Pig usage: word counting snippet

• Apache Hive Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis. With Hive we have one more layer in Hadoop stack abstracting the low-level Java API for making queries into an SQL-like query (HiveQL) interface. With this, since most data warehousing applications work with SQL-based querying languages, an SQL-based applications can be ported to Hadoop with less problems.

• Apache HBase HBase is an open source, non relational (column-oriented key-value), distributed database modeled based on Google's Bigtable. It runs on top of HDFS (Hadoop Distributed File System), providing Bigtable-like capabilities for Hadoop. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed no only through the Java API but also through REST, Avro or Thrift gateway APIs.
• Apache Phoenix Apache Phoenix is an open source, massively parallel, relational database engine supporting online transaction processing (OLTP) for Hadoop using Apache HBase.
• Apache Spark Apache Spark is an open source cluster-computing framework. It requires a cluster manager and a distributed storage system. For cluster management, Spark supports, among others, Hadoop YARN. For distributed storage, Spark can interface with a wide variety, including Hadoop Distributed File System (HDFS).
• Apache ZooKeeper Apache ZooKeeper (Chubby in Google's BigTable) is a software project of the Apache Software Foundation. It is essentially a distributed hierarchical key-value store, which is used to provide a distributed configuration service, synchronization service, and naming registry for large distributed systems.
• Apache Flume Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
• Apache Sqoop Apache Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.
• Apache Storm Apache Storm is an open source distributed stream processing computation framework.

Although the Hadoop framework itself is mostly written in the Java programming language, and this might be a most natural choice when some new software is created, there are many other languages wich can be used. The most popular are: Python and Scala.

HBase

As mentioned earlier, Google published three key papers revealing the architecture of their platform. The GFS and MapReduce papers served as the basis for the core Hadoop architecture. The third paper describing BigTable served as the basis for one of Hadoop components and the first NoSQL database systems: HBase.

HBase uses Hadoop HDFS as a file system in the same way that most traditional relational databases use the operating system file system. By using Hadoop HDFS as its file system HBase gets for free the following features.

• HBase is able to create tables containing very huge amounts of data. Much bigger than is able to be stored in relational databases.(?)
• HBase need not store multiple copies of data to protect against data loss thanks to the fault tolerance of HDFS.

Although HBase is a schema free database (like all NoSQL databases) it does enforce some kind of structure on the data. This structure is very basic and is based on well known terms: columns, rows, tables and keys. However, HBase tables vary significantly from the relational tables with which we are familiar.

To get a sense how the HBase data model works we describe two concepts:

• aggregation oriented model,
• sparse data.

Aggregation related model

Data aggregation is the compiling of information from databases with intent to prepare combined datasets for data processing.

In database management an aggregate function is a function where the values of multiple rows are grouped together to form a single value of more significant meaning or measurement such as a set, a bag or a list.

Common aggregate functions include: average, count, maximum, sum.

To explain this concept we will recall our Star Wars based example with orders products from the Space Universe store.

In a relational model, each table field must contain basic (atomic) information (then the table is in the first normal form). This excludes, for example, the nesting of tuples inside other tuples. Even tuples as such should be eliminated. This is a bad model for clustering. There is a natural unit for replication and sharing: aggregates.

Note that, aggregated data-oriented databases are also application oriented, as programmers typically handle aggregation-oriented data. In most cases we don't want, for example, atomic data about order, client or invoice to combine them in one object in our application. Better option is to get object we need directly from our database system. In some sense this is what ORM's do: maps set of objects into one object suitable for direct processing in our application. Aggregated database models are: key-value databases, document databases, column families stores.
The key-value database can store any data, they are "transparent" to the database (the database does not "see" their structure). It can only impose limitations on data size. A document databases are more restrictive: there are some limitation concerning structures we can use as well as data types.
In a key-value database, access to aggregation can only be obtained through the aggregation key. You cannot perform operations on aggregation parts here. A document databases are more flexible. We can perform database operations based on the aggregation part.

If we want to present our Star Wars based example in an aggregation-oriented version, we must first determine the aggregation boundaries. These limits need to be determined taking into account what data will most often be collected together. If we use the data most often only for order processing, we can put everything in one "orders" aggregation. In this case this structure written in JSON format looks like this:

If the most commonly use case is sales of an items analysis, it might be better to create two aggregates "customers" and "orders".

Sparse data

What’s typical about aggregation-oriented data representation is that if we look them in a tabular form, we would see that only a few cells contains data. This case is similar to sparse matrix which is a grid of values where only a small percent of cells contain values.

If we will show our orders aggregation from previous section in tabular form it will be like this:

As we can see in this trivial example in Invoice details section we have 4 x 12 = 48 cells but only 14 of them have some values. The more items we will offer the more columns we will have but the same time less cells will have some values because people will buy just a few things from a whole assortment. This is how we get a sparse data.

Column family (BigTable) stores

Column family systems are important NoSQL data architecture patterns because they can scale to manage large volumes of data.

A general idea behind column families data model is to support frequent access to data used together. Column families for a single row may or may not be near each other when stored on disk, but columns within a one column family are kept together. This is a fundamental element supporting (or supported) by aggregations. It's true, that HBase is a schema free database, so we can store what we want, when we want (we don't have to specify values for all columns as well as we don't have to specify types of our data) and how we want (no need to create tables and columns ahead) but one thing should be decided before we start our database: column families.

The BigTable database, in terms of structure, is completely different from relational databases. It is a fixed (persistent) multidimensional ordered map. Column family stores use row and column identifiers as general purposes keys for data lookup in this map. More precisely map is indexed by four keys/names: row keys, column families, column keys, and time stamps (the latter defines different versions of the same data). All values in this map appear as unmanaged character tables (their interpretation, such as data type, is an application task).

In an example below we will shows an example of such a map saved in JSON format. We can see that BigTable organizes data as maps nested in other maps. Within each map, the data is physically sorted by the keys of the map.

The basic unit of storing information is the row identified by a key. A row key usage allows access to the associated columns (also identified by theirs keys). The row key is a string of arbitrary characters from a typical length of 10 to 100 bytes (the maximum length is 64KiB). Saving and reading rows is an "atomic" (indivisible) operation regardless of the number of columns we read or write to. BigTable stores rows structured lexicographically according to theirs keys.

Rows in the table are dynamically partitioned (divided into smaller subsets). Each of these subset (row ranges), called a tablet, determines a distribution unit in a cluster. Wise use of the tablet(s) helps to balance the database load. Readings from small row ranges are more efficient, because they typically require communication between fewer machines, in the best case we use data from one machine.

Columns. The rows consists of columns. Each row may have a different number of columns storing different data types (including compound types).

The column keys are grouped into sets called column families. They are the basic unit when accessing data. Data stored in a given family is usually of the same type. Column families must be created before starting data placement. Once created, you cannot use a column key that does not belong to any family. The number of different column families in the table should not exceed 100 and should be rarely changed at work (their change requires a database stop and restart). The number of columns is not limited. They can be added and removed while the database is running.

There may be different versions of the same cell in the BigTable database identified by timestamp. For this reason, it is necessary to clearly label these versions. Therefore every cell contains a time stamp. It can be assigned by the BigTable database engine (in this case, the real time expressed in microseconds), or by client applications. Different versions of cells are stored in the descending order of the date. This is because the latest versions can be read first. There are two options to not store to many versions and automatically remove older data: keep the last $n$ versions, or for example versions from some time period (eg the last week).

From the above set of information we can infer that, the way to access a cell in a BigTable table is four-dimensional. To access a particular table cell, we enter the following four keys: row key, column family name, column name, time stamp.

We may use one set of four keys (row key, column family name, column name, time stamp) to specify one specific cell version.

Using a set of three keys (row key, column family name, column name) we specify all versions of a given cell.

Using a set of two keys (row key, column family name), we specify all cells in a given column family.

Using a row key we specify all columns from all column families.

Working example, 1: HBase basics

Technically HBase is written in Java, works on top of Hadoop, and relies on ZooKeeper. A HBase cluster can be set up in either local or distributed mode. Distributed mode can further be either pseudo distributed or fully distributed mode.

HBase needs the Java Runtime Environment (JRE) to be installed and available on the system so this would be our first step. Oracle’s Java is the recommended package for use in production systems.

This section describes the setup of a single node standalone HBase. To run HBase in the standalone mode, we don’t need to do a lot. A standalone instance has all HBase components -- the Master, RegionServers, and ZooKeeper -- running in a single JVM saving data to the local filesystem.

Java installation

Perform the following steps for installing Java

1. We are going to use HBase in version 2.0 and this, as for now, supports Java 8.

2. We are required to set the JAVA_HOME environment variable before starting HBase. We can set the variable via an operating system’s usual mechanism as we do it now or use HBase mechanism and edit conf/hbase-env.sh file as we do in further section.

HBase installation

1. Download the HBase binary. Choose a download site from this list of Apache Download Mirrors. Click on the suggested top link. This will take us to a mirror of HBase Releases. Click on the folder named stable and then download the binary file (.tar.gz). In our case this is hbase-1.2.6-bin.tar.gz (about 100MiB) file dated on 2017-06-20 12:42.

and change your working directory to the just-created
3. As we have stated in a previous section, we are required to set the JAVA_HOME environment variable before starting HBase. We can do this either via an operating system’s usual mechanism or use HBase mechanism and edit conf/hbase-env.sh file as we do now. Edit this file, uncomment the line starting with JAVA_HOME, and set it to the appropriate location for our operating system -- it should be set to a directory which contains the executable file bin/java. Because most Linux operating systems provide a mechanism for transparently switching between versions of executables such as Java (see /usr/bin/alternatives lines in previous section) we can set JAVA_HOME to the directory containing the symbolic link to bin/java, which is usually /usr.

In our case lines

were replaced by

4. Edit conf/hbase-site.xml, which is the main HBase configuration file. Working in the standalone mode we only need to specify the directory on the local filesystem where HBase and ZooKeeper write data. By default, a new directory is created under /tmp which may not be what we want because this is temporary directory and its contents is... temporary and may be deleted. The following configuration will store HBase’s data in the HBase directory, in the home directory of the user called nosql.

5. Almost done. The HBase installation provides a convenient way to start it via the bin/start-hbase.sh After issuing the command, if all goes well, a message is logged to standard output showing that HBase started successfully. We should verify this with the jps command: there should be only one running process called HMaster. In standalone mode HBase runs all daemons (HMaster, HRegionServer, ZooKeeper) within this single JVM.

HBase also comes with a preinstalled web-based management console that can be accessed using http://localhost:60010. By default, it is deployed on HBase's Master host at port 60010. This UI provides information about various components such as region servers, tables, running tasks, logs, and so on.

HBase basic usage

1. Connect to HBase
2. Get HBase status
3. Create a table
6. Put data into our table