Column family (BigTable) stores

Informations presented on this page may be incomplete or outdated. You can find the most up-to-date version reading my book NoSQL. Theory and examples

In this part we cover the following topics

The origins
- The data driven world
- From Big Data through Big Table to Hadoop
Hadoop basics
HBase
Working example, 1: HBase basics

The origins

The data driven world

We live in a data driven world and there are no premises that this situation will change in the nearest future. Now we are witnesses of something which can be called data revolution. As the industrial revolution has changed the way we think about production from being created essentially by hand in house to being created in assembly lines that are located in big factories so our data sources also has changed. Time ago all data was generated locally "in house". Now, data comes in from everywhere: our actions (visiting web pages, online shops, playing games , etc.), social networks, various sensors , etc. We can say, that situation is even worse: data becomes a perpetuum mobile: any set of data generates twice as much new data which generates twice as much new data which...

More data allows us to use more efficient and sophisticated algorithms. Better algorithms means that we can work better, faster, safer and our general satisfaction with the use of technology is increasing. The more people is satisfied with the technology they use, the more new users will it attract. More users generate more data. More data allows us to...

From Big Data through Big Table to Hadoop

There are people who says: There is nothing more practical than good theory. I don't like this statement and instead say: There is nothing more theoretical than good practice. Hadoop is a good example of this: practical needs have led to the new computational paradigm and one of the best known type of the NoSQL databases: column family.

The story began when PageRank was developed in Google. Without getting into the details of which interested readers can read in a books or internet, one can say that PageRank allowed the relevance of a page to be weighted based on the number of links to that page instead of simply indexed on keywords within webpages. With this we are making implicit assumption that the more popular a page is the more links to that page can be found in the whole network. This seems to be reasonable: if I don't like someone's page I don't have a link to it and conversely, I have links to the pages I like. PageRank is a great example of a data-driven algorithm that leverages the collective intelligence (known also as wisdom of the crowd) and that can adapt intelligently as more data is available which is very close related with another trendy and popular term -- machine learning -- close related with big data which in turn is related with NoSQL.

Predicted large amount of data to be processed, PageRank required the use of non-standard solutions.

Google decided to extensively use massively parallelizing and distributing processing across very large numbers of low budget servers.
Instead of dedicated storage servers built and operated by other companies, storage based on disks directly attached to the same servers which perform data processing.
Redefine unit of computation. No more individual servers. Instead the Google Modular Data Center were used. The Modular Data Center comprises shipping containers that house about a thousand custom-designed Intel based servers running Linux. Each module includes an independent power supply and air conditioning. Thus, data center capacity is increased not by adding new servers individually but by adding new modules with 1,000 servers each.

We have to emphasize here the most significant assumption made by Google: data collecting and processing are based on highly parallel and distributed technology. There was only one problem: such a technology didn't exist at the time.

To solve this problem, Google developed its own uniques software stack. There were three major software layers to serve as the foundation for the Google platform.

Google File System (GFS). GFS is a distributed cluster file system that allows all of the disks to be accessed as one massive, distributed, redundant file system.
MapReduce. A distributed processing framework that allows run parallel algorithms across very large number of servers. Architecture assumes nodes' unreliability, and dealing with huge datasets. MapReduce is a framework for performing parallel processing on large datasets across multiple computers (nodes). In the MapReduce framework, the map operation has a master node which breaks up an operation into subparts and distributes each operation to another node for processing. The reduce is the process where the master node collects the results from the other nodes and combines them into the answer to the original problem.
In some sense MapReduce resembles divide and conquer approach in algorithmic which is a design paradigm based on multi-branched recursion. A divide and conquer algorithm works by recursively breaking down a problem into two or more sub-problems of the same or related type, until these become simple enough to be solved directly. The solutions to the sub-problems are then combined to give a solution to the original problem.
BigTable. A database system that uses the GFS for storage. It means that this was highly distributed layer. From previous material we guess that it couldn't based on relational model.

In this architecture data which are going to be computed are divided into large blocks and distributes across nodes. Then packaged code is transferred into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

Our point of interest in this tutorial is the last component: the database layer, so we will skip all the further information related to the previous two components.

Hadoop: Open-Source Google Stack

The Google stack as a directly available software wasn't generally exposed to the public. Fortunately details of it were published in a series of papers ([A1], [A2], [A3]). Based on the ideas described in these document, and having positively verified working examples (Google itself), an open source Apache project being an equivalent of Google stack, has been started. This project is known under the Hadoop name.

Hadoop's origins

In 2004, Doug Cutting and Mike Cafarella were working on an open-source project to build a web search engine called Nutch, which would be on top of Apache Lucene. Lucene was a Java-based text indexing and searching library. The Nutch team tend to use Lucene to build a scalable web search engine with the explicit intent of creating an open-source equivalent of the proprietary technologies used by Google.

Initial versions of Nutch could not demonstrate the required scalability allowing to index the entire web. A solution comes from the GFS and MapReduce thanks to Google's published papers (2003, 2004). Having an access to a proven architectural conception, the Nutch team implemented their own equivalents of Google's stack. Because MapReduce was tuned not only to web processing, it soon became clear that this paradigm can be used to a huge number of different massively data processing tasks. The resulting project, named Hadoop, was made available for download in 2007, and became a top-level Apache project in 2008.

In some sense Hadoop has become the de facto standard for massive unstructured data storage and processing. If you don't agree, say why giants such as IBM, Microsoft or Oracle offers Hadoop in their product set.

We can say that no any other project before has had as great influence on Big Data as Hadoop. The main reasons for this were (and still are)

Accessibility Hadoop was free - everyone can download, install and work with it. Everyone can do this because we don't need superserver to host this software stack - we can install it even on a low budget laptop.
Good level of abstraction GFS abstracts the storage contained in the nodes. MapReduce abstracts the processing power contained within these nodes. BigTable allows easy storage of almost everything. One week of self training is enough to start using this platform.
Perfect scalability We can start very small company almost without any costs. Over time, as we become more and more recognizable, we can offer "more" without needs to architectural changes - we simply add another nodes to get more power. This allows many start-ups to offer good solutions based on a proven architectural conception.

Hadoop basics

Apache Hadoop's main components: MapReduce -- a processing part, and Hadoop Distributed File System (HDFS) -- a storage part were inspired by Google papers on their MapReduce and Google File System. Hadoop’s architecture roughly parallels that of Google. It consists of computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common situation and should be automatically handled by the framework.

The base Apache Hadoop framework is composed of the following modules:

Hadoop Common – contains libraries and utilities needed by other Hadoop modules.
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines.
Hadoop YARN – a platform responsible for managing computing resources in clusters and using them for scheduling users' applications.
Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale data processing.

Nowadays the term Hadoop refers not just to the base components, but rather to the whole ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop:

Apache Pig Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language used for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce or Apache Spark. With Pig Latin we have one more layer in Hadoop stack abstracting the Java based MapReduce approach into a notation which allows high level programming with MapReduce.

Let's look in a basic example of Pig usage: word counting snippet

input_lines = LOAD 'input_data.txt' AS (line:chararray);
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES '\\w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/results.txt';

input_lines = LOAD 'input_data.txt' AS (line:chararray);

words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

filtered_words = FILTER words BY word MATCHES '\\w+';

word_groups = GROUP filtered_words BY word;

word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;

ordered_word_count = ORDER word_count BY count DESC;

STORE ordered_word_count INTO '/tmp/results.txt';

Apache Hive Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis. With Hive we have one more layer in Hadoop stack abstracting the low-level Java API for making queries into an SQL-like query (HiveQL) interface. With this, since most data warehousing applications work with SQL-based querying languages, an SQL-based applications can be ported to Hadoop with less problems.

DROP TABLE IF EXISTS docs;
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'input_data.txt' OVERWRITE INTO TABLE input_lines;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
   (SELECT explode(split(line, '\s')) AS word FROM input_lines) temp
GROUP BY word
ORDER BY word;

DROP TABLE IF EXISTS docs;

CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'input_data.txt' OVERWRITE INTO TABLE input_lines;

CREATE TABLE word_counts AS

SELECT word, count(1) AS count FROM

(SELECT explode(split(line, '\s')) AS word FROM input_lines) temp

GROUP BY word

ORDER BY word;

Apache HBase HBase is an open source, non relational (column-oriented key-value), distributed database modeled based on Google's Bigtable. It runs on top of HDFS (Hadoop Distributed File System), providing Bigtable-like capabilities for Hadoop. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed no only through the Java API but also through REST, Avro or Thrift gateway APIs.
Apache Phoenix Apache Phoenix is an open source, massively parallel, relational database engine supporting online transaction processing (OLTP) for Hadoop using Apache HBase.
Apache Spark Apache Spark is an open source cluster-computing framework. It requires a cluster manager and a distributed storage system. For cluster management, Spark supports, among others, Hadoop YARN. For distributed storage, Spark can interface with a wide variety, including Hadoop Distributed File System (HDFS).
Apache ZooKeeper Apache ZooKeeper (Chubby in Google's BigTable) is a software project of the Apache Software Foundation. It is essentially a distributed hierarchical key-value store, which is used to provide a distributed configuration service, synchronization service, and naming registry for large distributed systems.
Apache Flume Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
Apache Sqoop Apache Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.
Apache Storm Apache Storm is an open source distributed stream processing computation framework.

Although the Hadoop framework itself is mostly written in the Java programming language, and this might be a most natural choice when some new software is created, there are many other languages wich can be used. The most popular are: Python and Scala.

HBase

As mentioned earlier, Google published three key papers revealing the architecture of their platform. The GFS and MapReduce papers served as the basis for the core Hadoop architecture. The third paper describing BigTable served as the basis for one of Hadoop components and the first NoSQL database systems: HBase.

HBase uses Hadoop HDFS as a file system in the same way that most traditional relational databases use the operating system file system. By using Hadoop HDFS as its file system HBase gets for free the following features.

HBase is able to create tables containing very huge amounts of data. Much bigger than is able to be stored in relational databases.(?)
HBase need not store multiple copies of data to protect against data loss thanks to the fault tolerance of HDFS.

Although HBase is a schema free database (like all NoSQL databases) it does enforce some kind of structure on the data. This structure is very basic and is based on well known terms: columns, rows, tables and keys. However, HBase tables vary significantly from the relational tables with which we are familiar.

To get a sense how the HBase data model works we describe two concepts:

aggregation oriented model,
sparse data.

Aggregation related model

Data aggregation is the compiling of information from databases with intent to prepare combined datasets for data processing.

Data aggregation

In database management an aggregate function is a function where the values of multiple rows are grouped together to form a single value of more significant meaning or measurement such as a set, a bag or a list.

Common aggregate functions include: average, count, maximum, sum.

Aggregate function

To explain this concept we will recall our Star Wars based example with orders products from the Space Universe store.

    Table: invoice
	Invoice  Customer
	number   number
	=================
	1        10
	2        30
	3        20
	4        30
	
	Table: customer_details
	Customer  Customer        Customer
	number    name            location
	==================================
	10        Dart Vader      Star Destroyer
	20        Luke Skywalker  Naboo
	30        C3PO            Tatooine
	
	Table: invoice_details
	Invoice   Invoice  Item         Item       Item
	number    item     name         quantity   price
	================================================
	1         1        lightsaber   1          100
	1         2        black cloak  2          50
	1         3        air filter   10         2
	2         1        battery      1          25
	3         1        lightsaber   5          75
	3         2        belt         1          5 
	4         1        wires        1          10

Table: invoice

Invoice Customer

number number

=================

1 10

2 30

3 20

4 30

Table: customer_details

Customer Customer Customer

number name location

==================================

10 Dart Vader Star Destroyer

20 Luke Skywalker Naboo

30 C3PO Tatooine

Table: invoice_details

Invoice Invoice Item Item Item

number item name quantity price

================================================

1 1 lightsaber 1 100

1 2 black cloak 2 50

1 3 air filter 10 2

2 1 battery 1 25

3 1 lightsaber 5 75

3 2 belt 1 5

4 1 wires 1 10

In a relational model, each table field must contain basic (atomic) information (then the table is in the first normal form). This excludes, for example, the nesting of tuples inside other tuples. Even tuples as such should be eliminated. This is a bad model for clustering. There is a natural unit for replication and sharing: aggregates.

Note that, aggregated data-oriented databases are also application oriented, as programmers typically handle aggregation-oriented data. In most cases we don't want, for example, atomic data about order, client or invoice to combine them in one object in our application. Better option is to get object we need directly from our database system. In some sense this is what ORM's do: maps set of objects into one object suitable for direct processing in our application. Aggregated database models are: key-value databases, document databases, column families stores.
The key-value database can store any data, they are "transparent" to the database (the database does not "see" their structure). It can only impose limitations on data size. A document databases are more restrictive: there are some limitation concerning structures we can use as well as data types.
In a key-value database, access to aggregation can only be obtained through the aggregation key. You cannot perform operations on aggregation parts here. A document databases are more flexible. We can perform database operations based on the aggregation part.

If we want to present our Star Wars based example in an aggregation-oriented version, we must first determine the aggregation boundaries. These limits need to be determined taking into account what data will most often be collected together. If we use the data most often only for order processing, we can put everything in one "orders" aggregation. In this case this structure written in JSON format looks like this:

// orders aggregation
{
	"Invoice number" : 1,
	"Invoice details" : [
		{"Item name" : "lightsaber", "Item quantity" : 1, "Item price" : 100}, 
		{"Item name" : "black cloak", "Item quantity" : 2, "Item price" : 50},
		{"Item name" : "air filter", "Item quantity" : 10, "Item price" : 2}
	],
	"Customer details" : {
		"Customer name" : "Dart Vader"
		"Customer location" : "Star Destroyer"
	}
},

{	
	"Invoice number" : 2,
	"Invoice details" : [
		{"Item name" : "battery", "Item quantity" : 1, "Item price" : 25}
	],
	"Customer details" : {
		"Customer name" : "C3PO"
		"Customer location" : "Tatooine"
	}
},

{	
	"Invoice number" : 3,
	"Invoice details" : [
		{"Item name" : "lightsaber", "Item quantity" : 5, "Item price" : 75}, 
		{"Item name" : "belt", "Item quantity" : 1, "Item price" : 5}
	],
	"Customer details" : {
		"Customer name" : "Luke Skywalker"
		"Customer location" : "Naboo"
	}
},

{	
	"Invoice number" : 4,
	"Invoice details" : [
		{"Item name" : "wires", "Item quantity" : 1, "Item price" : 10}
	],
	"Customer details" : {
		"Customer name" : "C3PO"
		"Customer location" : "Tatooine"
	}
}

// orders aggregation

{

"Invoice number" : 1,

"Invoice details" : [

{"Item name" : "lightsaber", "Item quantity" : 1, "Item price" : 100},

{"Item name" : "black cloak", "Item quantity" : 2, "Item price" : 50},

{"Item name" : "air filter", "Item quantity" : 10, "Item price" : 2}

"Customer details" : {

"Customer name" : "Dart Vader"

"Customer location" : "Star Destroyer"

}

{

"Invoice number" : 2,

"Invoice details" : [

{"Item name" : "battery", "Item quantity" : 1, "Item price" : 25}

"Customer details" : {

"Customer name" : "C3PO"

"Customer location" : "Tatooine"

}

{

"Invoice number" : 3,

"Invoice details" : [

{"Item name" : "lightsaber", "Item quantity" : 5, "Item price" : 75},

{"Item name" : "belt", "Item quantity" : 1, "Item price" : 5}

"Customer details" : {

"Customer name" : "Luke Skywalker"

"Customer location" : "Naboo"

}

{

"Invoice number" : 4,

"Invoice details" : [

{"Item name" : "wires", "Item quantity" : 1, "Item price" : 10}

"Customer details" : {

"Customer name" : "C3PO"

"Customer location" : "Tatooine"

}

If the most commonly use case is sales of an items analysis, it might be better to create two aggregates "customers" and "orders".

// customers aggregation
{
	"Customer number" : 10,
	"Customer name" : "Dart Vader",
	"Customer location" : "Star Destroyer"
},

{
	"Customer number" : 20,
	"Customer name" : "Luke Skywalker",
	"Customer location" : "Naboo"
},

{
	"Customer number" : 30,
	"Customer name" : "C3PO",
	"Customer location" : "Tatooine"
}

// orders aggregation
{
	"Customer number" : 10,
	"Invoice number" : 1,
	"Invoice details" : [
		{"Item name" : "lightsaber", "Item quantity" : 1, "Item price" : 100}, 
		{"Item name" : "black cloak", "Item quantity" : 2, "Item price" : 50},
		{"Item name" : "air filter", "Item quantity" : 10, "Item price" : 2}
	]
},

{
	"Customer number" : 30,
	"Invoice number" : 2,
	"Invoice details" : [
		{"Item name" : "battery", "Item quantity" : 1, "Item price" : 25}
	]
},

{
	"Customer number" : 20,
	"Invoice number" : 3,
	"Invoice details" : [
		{"Item name" : "lightsaber", "Item quantity" : 5, "Item price" : 75}, 
		{"Item name" : "belt", "Item quantity" : 1, "Item price" : 5}
	]
},

{
	"Customer number" : 30,
	"Invoice number" : 4,
	"Invoice details" : [
		{"Item name" : "wires", "Item quantity" : 1, "Item price" : 10}
	]
}

// customers aggregation

{

"Customer number" : 10,

"Customer name" : "Dart Vader",

"Customer location" : "Star Destroyer"

{

"Customer number" : 20,

"Customer name" : "Luke Skywalker",

"Customer location" : "Naboo"

{

"Customer number" : 30,

"Customer name" : "C3PO",

"Customer location" : "Tatooine"

}

// orders aggregation

{

"Customer number" : 10,

"Invoice number" : 1,

"Invoice details" : [

{"Item name" : "lightsaber", "Item quantity" : 1, "Item price" : 100},

{"Item name" : "black cloak", "Item quantity" : 2, "Item price" : 50},

{"Item name" : "air filter", "Item quantity" : 10, "Item price" : 2}

]

{

"Customer number" : 30,

"Invoice number" : 2,

"Invoice details" : [

{"Item name" : "battery", "Item quantity" : 1, "Item price" : 25}

]

{

"Customer number" : 20,

"Invoice number" : 3,

"Invoice details" : [

{"Item name" : "lightsaber", "Item quantity" : 5, "Item price" : 75},

{"Item name" : "belt", "Item quantity" : 1, "Item price" : 5}

]

{

"Customer number" : 30,

"Invoice number" : 4,

"Invoice details" : [

{"Item name" : "wires", "Item quantity" : 1, "Item price" : 10}

]

}

Sparse data

What’s typical about aggregation-oriented data representation is that if we look them in a tabular form, we would see that only a few cells contains data. This case is similar to sparse matrix which is a grid of values where only a small percent of cells contain values.

If we will show our orders aggregation from previous section in tabular form it will be like this:

       | Customer details    | Invoice details                                                                                                                  |
-------+----------+----------+------------+------------+-------------+-------------+------------+------------+------+-------+---------+---------+-------+-------+
Invoice| Customer | Customer | lightsaber | lightsaber | black cloak | black cloak | air filter | air filter | belt | belt  | battery | battery | wires | wires |
number | name     | location |            | price      |             | price       |            | price      |      | price |         | price   |       | price |
-------+----------+----------+------------+------------+-------------+-------------+------------+------------+------+-------+---------+---------+-------+-------+
1      |Dart Vader|Star      | 1          | 100        | 2           | 50          | 10         | 2          |      |       |         |         |       |       |
       |          |Destroyer |            |            |             |             |            |            |      |       |         |         |       |       |
-------+----------+----------+------------+------------+-------------+-------------+------------+------------+------+-------+---------+---------+-------+-------+
2      |C3PO      |Tatooine  |            |            |             |             |            |            |      |       | 1       | 25      |       |       |
-------+----------+----------+------------+------------+-------------+-------------+------------+------------+------+-------+---------+---------+-------+-------+
3      |Luke      |Naboo     | 5          | 75         |             |             |            |            | 1    | 5     |         |         |       |       |
       |Skywalker |          |            |            |             |             |            |            |      |       |         |         |       |       |
-------+----------+----------+------------+------------+-------------+-------------+------------+------------+------+-------+---------+---------+-------+-------+
4      |C3PO      |Tatooine  |            |            |             |             |            |            |      |       |         |         | 1     |10     |
-------+----------+----------+------------+------------+-------------+-------------+------------+------------+------+-------+---------+---------+-------+-------+

| Customer details | Invoice details |

-------+----------+----------+------------+------------+-------------+-------------+------------+------------+------+-------+---------+---------+-------+-------+

-------+----------+----------+------------+------------+-------------+-------------+------------+------------+------+-------+---------+---------+-------+-------+

1 |Dart Vader|Star | 1 | 100 | 2 | 50 | 10 | 2 | | | | | | |

-------+----------+----------+------------+------------+-------------+-------------+------------+------------+------+-------+---------+---------+-------+-------+

2 |C3PO |Tatooine | | | | | | | | | 1 | 25 | | |

-------+----------+----------+------------+------------+-------------+-------------+------------+------------+------+-------+---------+---------+-------+-------+

3 |Luke |Naboo | 5 | 75 | | | | | 1 | 5 | | | | |

-------+----------+----------+------------+------------+-------------+-------------+------------+------------+------+-------+---------+---------+-------+-------+

4 |C3PO |Tatooine | | | | | | | | | | | 1 |10 |

-------+----------+----------+------------+------------+-------------+-------------+------------+------------+------+-------+---------+---------+-------+-------+

As we can see in this trivial example in Invoice details section we have 4 x 12 = 48 cells but only 14 of them have some values. The more items we will offer the more columns we will have but the same time less cells will have some values because people will buy just a few things from a whole assortment. This is how we get a sparse data.

Column family (BigTable) stores

Column family systems are important NoSQL data architecture patterns because they can scale to manage large volumes of data.

A general idea behind column families data model is to support frequent access to data used together. Column families for a single row may or may not be near each other when stored on disk, but columns within a one column family are kept together. This is a fundamental element supporting (or supported) by aggregations. It's true, that HBase is a schema free database, so we can store what we want, when we want (we don't have to specify values for all columns as well as we don't have to specify types of our data) and how we want (no need to create tables and columns ahead) but one thing should be decided before we start our database: column families.

The BigTable database, in terms of structure, is completely different from relational databases. It is a fixed (persistent) multidimensional ordered map. Column family stores use row and column identifiers as general purposes keys for data lookup in this map. More precisely map is indexed by four keys/names: row keys, column families, column keys, and time stamps (the latter defines different versions of the same data). All values in this map appear as unmanaged character tables (their interpretation, such as data type, is an application task).

In an example below we will shows an example of such a map saved in JSON format. We can see that BigTable organizes data as maps nested in other maps. Within each map, the data is physically sorted by the keys of the map.

{
	"fulmanp" : { ----------------- row key
		"info" : { ---------------- column family
			"email" : { ----------- column name
				t1: fulmanp@math.uni.lodz.pl
			},
			"name" : {
				t1: Piotr
			},
			"phone" : {
				t1: 123456 -------- versions (indexed by timestamps)
				t4: 456789
			},
			"surname" : {
				t1: Fulmański
			}
		}
	},
	"radmat" : {
		"info" : {
			"email" : {
				t2: radmat@math.uni.lodz.pl
			},
			"name" : {
				t2: Radosław
			},
			"phone" : {
				t2: 234567
			},
			"surname" : {
				t2: Matusik
			},
		}
	},
	"wojczyk" : {
		"info" : {
			"email" : {
				t3: wojczyk@math.uni.lodz.pl
			},
			"name" : {
				t3: Sebastian
			},
			"phone" : {
				t3: 345678
			},
			"surname" : {
				t3: Wojczyk
			}
		}
	}
}

{

"fulmanp" : { ----------------- row key

"info" : { ---------------- column family

"email" : { ----------- column name

t1: fulmanp@math.uni.lodz.pl

"name" : {

t1: Piotr

"phone" : {

t1: 123456 -------- versions (indexed by timestamps)

t4: 456789

"surname" : {

t1: Fulmański

}

"radmat" : {

"info" : {

"email" : {

t2: radmat@math.uni.lodz.pl

"name" : {

t2: Radosław

"phone" : {

t2: 234567

"surname" : {

t2: Matusik

}

"wojczyk" : {

"info" : {

"email" : {

t3: wojczyk@math.uni.lodz.pl

"name" : {

t3: Sebastian

"phone" : {

t3: 345678

"surname" : {

t3: Wojczyk

}

The basic unit of storing information is the row identified by a key. A row key usage allows access to the associated columns (also identified by theirs keys). The row key is a string of arbitrary characters from a typical length of 10 to 100 bytes (the maximum length is 64KiB). Saving and reading rows is an "atomic" (indivisible) operation regardless of the number of columns we read or write to. BigTable stores rows structured lexicographically according to theirs keys.

Rows in the table are dynamically partitioned (divided into smaller subsets). Each of these subset (row ranges), called a tablet, determines a distribution unit in a cluster. Wise use of the tablet(s) helps to balance the database load. Readings from small row ranges are more efficient, because they typically require communication between fewer machines, in the best case we use data from one machine.

Columns. The rows consists of columns. Each row may have a different number of columns storing different data types (including compound types).

The column keys are grouped into sets called column families. They are the basic unit when accessing data. Data stored in a given family is usually of the same type. Column families must be created before starting data placement. Once created, you cannot use a column key that does not belong to any family. The number of different column families in the table should not exceed 100 and should be rarely changed at work (their change requires a database stop and restart). The number of columns is not limited. They can be added and removed while the database is running.

There may be different versions of the same cell in the BigTable database identified by timestamp. For this reason, it is necessary to clearly label these versions. Therefore every cell contains a time stamp. It can be assigned by the BigTable database engine (in this case, the real time expressed in microseconds), or by client applications. Different versions of cells are stored in the descending order of the date. This is because the latest versions can be read first. There are two options to not store to many versions and automatically remove older data: keep the last $n$ versions, or for example versions from some time period (eg the last week).

From the above set of information we can infer that, the way to access a cell in a BigTable table is four-dimensional. To access a particular table cell, we enter the following four keys: row key, column family name, column name, time stamp.

row         |--- Column family: info                                         ---| 
key
|           Column 1      Column 2       Column 3                      Column 4
|           key           key            key                           key
|           |             |              |                             |
|           name          surname        email                         phone    
|           |             |              |                             |
fulmanp     t1: Piotr     t1: Fulmański  t1: fulmanp@math.uni.lodz.pl  t1: 123456 --- version
                                                                       t4: 456789     indexed
radmat      t2:Radosław   t2:Matusik     t2: radmat@math.uni.lodz.pl   t2: 234567     by timestamp (t1)
wojczyk     t3:Sebastian  t3:Wojczyk     t3: wojczyk@math.uni.lodz.pl  t3: 345678

row |--- Column family: info ---|

key

| Column 1 Column 2 Column 3 Column 4

| key key key key

| | | | |

| name surname email phone

| | | | |

fulmanp t1: Piotr t1: Fulmański t1: fulmanp@math.uni.lodz.pl t1: 123456 --- version

t4: 456789 indexed

radmat t2:Radosław t2:Matusik t2: radmat@math.uni.lodz.pl t2: 234567 by timestamp (t1)

wojczyk t3:Sebastian t3:Wojczyk t3: wojczyk@math.uni.lodz.pl t3: 345678

We may use one set of four keys (row key, column family name, column name, time stamp) to specify one specific cell version.

compound key                     value (version)
|                                |
[fulmanp, info, phone, t1] -> 123456

compound key value (version)

| |

[fulmanp, info, phone, t1] -> 123456

Using a set of three keys (row key, column family name, column name) we specify all versions of a given cell.

[fulmanp, info, phone] -> t1: 123456
                          t4: 456789

1 2	[fulmanp, info, phone] -> t1: 123456 t4: 456789

Using a set of two keys (row key, column family name), we specify all cells in a given column family.

[fulmanp, info] ->  {
			"email" : {
				t1: fulmanp@math.uni.lodz.pl
			},
			"name" : {
				t1: Piotr
			},
			"phone" : {
				t1: 123456
				t4: 456789
			},
			"surname" : {
				t1: Fulmański
			}
		}

[fulmanp, info] -> {

"email" : {

t1: fulmanp@math.uni.lodz.pl

"name" : {

t1: Piotr

"phone" : {

t1: 123456

t4: 456789

"surname" : {

t1: Fulmański

}

Using a row key we specify all columns from all column families.

[fulmanp] -> {
		"info" : {
			"email" : {
				t1: fulmanp@math.uni.lodz.pl
			},
			"name" : {
				t1: Piotr
			},
			"phone" : {
				t1: 123456
				t4: 456789
			},
			"surname" : {
				t1: Fulmański
			}
		}
	}

[fulmanp] -> {

"info" : {

"email" : {

t1: fulmanp@math.uni.lodz.pl

"name" : {

t1: Piotr

"phone" : {

t1: 123456

t4: 456789

"surname" : {

t1: Fulmański

}

Working example, 1: HBase basics

Technically HBase is written in Java, works on top of Hadoop, and relies on ZooKeeper. A HBase cluster can be set up in either local or distributed mode. Distributed mode can further be either pseudo distributed or fully distributed mode.

HBase needs the Java Runtime Environment (JRE) to be installed and available on the system so this would be our first step. Oracle’s Java is the recommended package for use in production systems.

This section describes the setup of a single node standalone HBase. To run HBase in the standalone mode, we don’t need to do a lot. A standalone instance has all HBase components -- the Master, RegionServers, and ZooKeeper -- running in a single JVM saving data to the local filesystem.

Java installation

Perform the following steps for installing Java

We are going to use HBase in version 2.0 and this, as for now, supports Java 8.

nosql@nosql-virtual-machine:~$ sudo add-apt-repository ppa:webupd8team/java
nosql@nosql-virtual-machine:~$ sudo apt update
nosql@nosql-virtual-machine:~$ sudo apt install oracle-java8-installer

[While the install process, you have to accept Java license to continue downloading & installing Java binaries.]

nosql@nosql-virtual-machine:~$ sudo apt install oracle-java8-set-default
nosql@nosql-virtual-machine:~$ java -version
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)

nosql@nosql-virtual-machine:~$ sudo add-apt-repository ppa:webupd8team/java

nosql@nosql-virtual-machine:~$ sudo apt update

nosql@nosql-virtual-machine:~$ sudo apt install oracle-java8-installer

[While the install process, you have to accept Java license to continue downloading & installing Java binaries.]

nosql@nosql-virtual-machine:~$ sudo apt install oracle-java8-set-default

nosql@nosql-virtual-machine:~$ java -version

java version "1.8.0_151"

Java(TM) SE Runtime Environment (build 1.8.0_151-b12)

Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)

We are required to set the JAVA_HOME environment variable before starting HBase. We can set the variable via an operating system’s usual mechanism as we do it now or use HBase mechanism and edit conf/hbase-env.sh file as we do in further section.

nosql@nosql-virtual-machine:~$ which java
/usr/bin/java
nosql@nosql-virtual-machine:~$ ls -l /usr/bin/java
lrwxrwxrwx 1 root root 22 lis  8 15:06 /usr/bin/java -> /etc/alternatives/java
nosql@nosql-virtual-machine:~$ ls -l /etc/alternatives/java
lrwxrwxrwx 1 root root 39 lis  8 15:06 /etc/alternatives/java -> /usr/lib/jvm/java-8-oracle/jre/bin/java
nosql@nosql-virtual-machine:~$ ls -l /usr/lib/jvm/java-8-oracle/
total 25952
drwxr-xr-x 2 root root     4096 lis  8 15:06 bin
-r--r--r-- 1 root root     3244 lis  8 15:06 COPYRIGHT
drwxr-xr-x 4 root root     4096 lis  8 15:06 db
drwxr-xr-x 3 root root     4096 lis  8 15:06 include
-rwxr-xr-x 1 root root  5200672 lis  8 15:06 javafx-src.zip
drwxr-xr-x 5 root root     4096 lis  8 15:06 jre
drwxr-xr-x 5 root root     4096 lis  8 15:06 lib
-r--r--r-- 1 root root       40 lis  8 15:06 LICENSE
drwxr-xr-x 4 root root     4096 lis  8 15:06 man
-r--r--r-- 1 root root      159 lis  8 15:06 README.html
-rw-r--r-- 1 root root      526 lis  8 15:06 release
-rw-r--r-- 1 root root 21115360 lis  8 15:06 src.zip
-rwxr-xr-x 1 root root    63933 lis  8 15:06 THIRDPARTYLICENSEREADME-JAVAFX.txt
-r--r--r-- 1 root root   145180 lis  8 15:06 THIRDPARTYLICENSEREADME.txt
root@nosql-virtual-machine:~# echo "export JAVA_HOME=/usr" >> /etc/profile
root@nosql-virtual-machine:~# echo $JAVA_HOME
/usr/lib/jvm/java-8-oracle

nosql@nosql-virtual-machine:~$ which java

/usr/bin/java

nosql@nosql-virtual-machine:~$ ls -l /usr/bin/java

lrwxrwxrwx 1 root root 22 lis 8 15:06 /usr/bin/java -> /etc/alternatives/java

nosql@nosql-virtual-machine:~$ ls -l /etc/alternatives/java

lrwxrwxrwx 1 root root 39 lis 8 15:06 /etc/alternatives/java -> /usr/lib/jvm/java-8-oracle/jre/bin/java

nosql@nosql-virtual-machine:~$ ls -l /usr/lib/jvm/java-8-oracle/

total 25952

drwxr-xr-x 2 root root 4096 lis 8 15:06 bin

-r--r--r-- 1 root root 3244 lis 8 15:06 COPYRIGHT

drwxr-xr-x 4 root root 4096 lis 8 15:06 db

drwxr-xr-x 3 root root 4096 lis 8 15:06 include

-rwxr-xr-x 1 root root 5200672 lis 8 15:06 javafx-src.zip

drwxr-xr-x 5 root root 4096 lis 8 15:06 jre

drwxr-xr-x 5 root root 4096 lis 8 15:06 lib

-r--r--r-- 1 root root 40 lis 8 15:06 LICENSE

drwxr-xr-x 4 root root 4096 lis 8 15:06 man

-r--r--r-- 1 root root 159 lis 8 15:06 README.html

-rw-r--r-- 1 root root 526 lis 8 15:06 release

-rw-r--r-- 1 root root 21115360 lis 8 15:06 src.zip

-rwxr-xr-x 1 root root 63933 lis 8 15:06 THIRDPARTYLICENSEREADME-JAVAFX.txt

-r--r--r-- 1 root root 145180 lis 8 15:06 THIRDPARTYLICENSEREADME.txt

root@nosql-virtual-machine:~# echo "export JAVA_HOME=/usr" >> /etc/profile

root@nosql-virtual-machine:~# echo $JAVA_HOME

/usr/lib/jvm/java-8-oracle

HBase installation

Download the HBase binary. Choose a download site from this list of Apache Download Mirrors. Click on the suggested top link. This will take us to a mirror of HBase Releases. Click on the folder named stable and then download the binary file (.tar.gz). In our case this is hbase-1.2.6-bin.tar.gz (about 100MiB) file dated on 2017-06-20 12:42.
Extract the downloaded file,

nosql@nosql-virtual-machine:~/Desktop/downloads$ tar -zxvf hbase-1.2.6-bin.tar.gz

1

nosql@nosql-virtual-machine:~/Desktop/downloads$ tar -zxvf hbase-1.2.6-bin.tar.gz

and change your working directory to the just-created

nosql@nosql-virtual-machine:~/Desktop/downloads$ cd hbase-1.2.6/ nosql@nosql-virtual-machine:~/Desktop/downloads/hbase-1.2.6$

1
2

nosql@nosql-virtual-machine:~/Desktop/downloads$ cd hbase-1.2.6/
nosql@nosql-virtual-machine:~/Desktop/downloads/hbase-1.2.6$
As we have stated in a previous section, we are required to set the JAVA_HOME environment variable before starting HBase. We can do this either via an operating system’s usual mechanism or use HBase mechanism and edit conf/hbase-env.sh file as we do now. Edit this file, uncomment the line starting with JAVA_HOME, and set it to the appropriate location for our operating system -- it should be set to a directory which contains the executable file bin/java. Because most Linux operating systems provide a mechanism for transparently switching between versions of executables such as Java (see /usr/bin/alternatives lines in previous section) we can set JAVA_HOME to the directory containing the symbolic link to bin/java, which is usually /usr.
In our case lines

hbase-env.sh
Vim

# The java implementation to use. Java 1.7+ required. # export JAVA_HOME=/usr/java/jdk1.6.0/

1
2

# The java implementation to use. Java 1.7+ required.
# export JAVA_HOME=/usr/java/jdk1.6.0/

were replaced by

hbase-env.sh
Vim

# The java implementation to use. Java 1.7+ required. export JAVA_HOME=/usr

1
2

# The java implementation to use. Java 1.7+ required.
export JAVA_HOME=/usr

Edit conf/hbase-site.xml, which is the main HBase configuration file. Working in the standalone mode we only need to specify the directory on the local filesystem where HBase and ZooKeeper write data. By default, a new directory is created under /tmp which may not be what we want because this is temporary directory and its contents is... temporary and may be deleted. The following configuration will store HBase’s data in the HBase directory, in the home directory of the user called nosql.

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///home/nosql/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/home/nosql/zookeeper</value>
  </property>
</configuration>

<name>hbase.rootdir</name>

<value>file:///home/nosql/hbase</value>

</property>

<name>hbase.zookeeper.property.dataDir</name>

<value>/home/nosql/zookeeper</value>

</property>

</configuration>

Almost done. The HBase installation provides a convenient way to start it via the bin/start-hbase.sh After issuing the command, if all goes well, a message is logged to standard output showing that HBase started successfully. We should verify this with the jps command: there should be only one running process called HMaster. In standalone mode HBase runs all daemons (HMaster, HRegionServer, ZooKeeper) within this single JVM.

HBase also comes with a preinstalled web-based management console that can be accessed using http://localhost:60010. By default, it is deployed on HBase's Master host at port 60010. This UI provides information about various components such as region servers, tables, running tasks, logs, and so on.

nosql@nosql-virtual-machine:~/Desktop/downloads/hbase-1.2.6$ bin/start-hbase.sh 
starting master, logging to /home/nosql/Desktop/downloads/hbase-1.2.6/bin/../logs/hbase-nosql-master-nosql-virtual-machine.out
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
nosql@nosql-virtual-machine:~/Desktop/downloads/hbase-1.2.6$ jps
46132 HMaster
46458 Jps

nosql@nosql-virtual-machine:~/Desktop/downloads/hbase-1.2.6$ bin/start-hbase.sh

starting master, logging to /home/nosql/Desktop/downloads/hbase-1.2.6/bin/../logs/hbase-nosql-master-nosql-virtual-machine.out

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0

nosql@nosql-virtual-machine:~/Desktop/downloads/hbase-1.2.6$ jps

46132 HMaster

46458 Jps

HBase basic usage

Connect to HBase

osql@nosql-virtual-machine:~/Desktop/downloads/hbase-1.2.6/bin$ ./hbase shell
2017-11-08 23:58:11,056 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.2.6, rUnknown, Mon May 29 02:25:32 CDT 2017

hbase(main):001:0>

osql@nosql-virtual-machine:~/Desktop/downloads/hbase-1.2.6/bin$ ./hbase shell

2017-11-08 23:58:11,056 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

HBase Shell; enter 'help<RETURN>' for list of supported commands.

Type "exit<RETURN>" to leave the HBase Shell

Version 1.2.6, rUnknown, Mon May 29 02:25:32 CDT 2017

hbase(main):001:0>

Get HBase status

Vim

hbase(main):001:0> status 1 active master, 0 backup masters, 1 servers, 1 dead, 2.0000 average load

1
2

hbase(main):001:0> status
1 active master, 0 backup masters, 1 servers, 1 dead, 2.0000 average load

Create a table

hbase(main):003:0> create 'tab1', 'cf1'
0 row(s) in 1.3710 seconds

=> Hbase::Table - tab1
hbase(main):004:0> create 'tab2', 'cf1', 'cf2'
0 row(s) in 1.2480 seconds

=> Hbase::Table - tab2

hbase(main):003:0> create 'tab1', 'cf1'

0 row(s) in 1.3710 seconds

=> Hbase::Table - tab1

hbase(main):004:0> create 'tab2', 'cf1', 'cf2'

0 row(s) in 1.2480 seconds

=> Hbase::Table - tab2

List information about tables

hbase(main):020:0> list
TABLE                                                                                                           
tab1                                                                                                            
tab2                                                                                                                                                 
2 row(s) in 0.0320 seconds

=> ["tab1", "tab2"]

hbase(main):020:0> list

TABLE

tab1

tab2

2 row(s) in 0.0320 seconds

=> ["tab1", "tab2"]

Describes the metadata information about tables

hbase(main):001:0> describe 'tab1'
Table tab1 is ENABLED                                                                                           
tab1                                                                                                            
COLUMN FAMILIES DESCRIPTION                                                                                     
{NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA
_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', B
LOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                                                  
1 row(s) in 0.5440 seconds

hbase(main):002:0> describe 'tab2'
Table tab2 is ENABLED                                                                                           
tab2                                                                                                            
COLUMN FAMILIES DESCRIPTION                                                                                     
{NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA
_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', B
LOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                                                  
{NAME => 'cf2', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA
_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', B
LOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                                                  
2 row(s) in 0.0310 seconds

hbase(main):001:0> describe 'tab1'

Table tab1 is ENABLED

tab1

COLUMN FAMILIES DESCRIPTION

{NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA

_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', B

LOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

1 row(s) in 0.5440 seconds

hbase(main):002:0> describe 'tab2'

Table tab2 is ENABLED

tab2

COLUMN FAMILIES DESCRIPTION

{NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA

_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', B

LOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

{NAME => 'cf2', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA

_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', B

LOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

2 row(s) in 0.0310 seconds

Put data into our table

hbase(main):003:0> put 'tab1', 'row-001', 'cf1:a', 'First value'
0 row(s) in 0.1280 seconds

hbase(main):004:0> put 'tab1', 'row-001', 'cf1:b', 123
0 row(s) in 0.0300 seconds

hbase(main):005:0> put 'tab1', 'row-001', 'cf1:c', 123.456
0 row(s) in 0.0080 seconds

hbase(main):006:0> put 'tab1', 'row-001', 'cf1:d', 456
0 row(s) in 0.0210 seconds

hbase(main):007:0> put 'tab1', 'row-001', 'cf1:d', 'Replacement'
0 row(s) in 0.0150 seconds

hbase(main):003:0> put 'tab1', 'row-001', 'cf1:a', 'First value'

0 row(s) in 0.1280 seconds

hbase(main):004:0> put 'tab1', 'row-001', 'cf1:b', 123

0 row(s) in 0.0300 seconds

hbase(main):005:0> put 'tab1', 'row-001', 'cf1:c', 123.456

0 row(s) in 0.0080 seconds

hbase(main):006:0> put 'tab1', 'row-001', 'cf1:d', 456

0 row(s) in 0.0210 seconds

hbase(main):007:0> put 'tab1', 'row-001', 'cf1:d', 'Replacement'

0 row(s) in 0.0150 seconds

Read data by row

hbase(main):009:0> get 'tab1', 'row-001'
COLUMN                        CELL                                                                              
 cf1:a                        timestamp=1510182483121, value=First value                                        
 cf1:b                        timestamp=1510182504781, value=123                                                
 cf1:c                        timestamp=1510182511625, value=123.456                                            
 cf1:d                        timestamp=1510182537839, value=Replacement                                        
4 row(s) in 0.0500 seconds

hbase(main):009:0> get 'tab1', 'row-001'

COLUMN CELL

cf1:a timestamp=1510182483121, value=First value

cf1:b timestamp=1510182504781, value=123

cf1:c timestamp=1510182511625, value=123.456

cf1:d timestamp=1510182537839, value=Replacement

4 row(s) in 0.0500 seconds

hbase(main):010:0> get 'tab1', 'row-001', {COLUMN => 'cf1:d', VERSIONS => 3}
COLUMN                        CELL                                                                              
 cf1:d                        timestamp=1510182537839, value=Replacement                                        
1 row(s) in 0.0220 seconds

hbase(main):010:0> get 'tab1', 'row-001', {COLUMN => 'cf1:d', VERSIONS => 3}

COLUMN CELL

cf1:d timestamp=1510182537839, value=Replacement

1 row(s) in 0.0220 seconds

hbase(main):012:0> create 'tab3',{NAME=>"cf1", VERSIONS=>5}
0 row(s) in 1.3510 seconds

=> Hbase::Table - tab3
hbase(main):013:0> describe 'tab3'
Table tab3 is ENABLED                                                                                           
tab3                                                                                                            
COLUMN FAMILIES DESCRIPTION                                                                                     
{NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '5', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA
_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', B
LOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                                                  
1 row(s) in 0.0480 seconds

hbase(main):014:0> put 'tab3', 'row-001', 'cf1:a', 123	
0 row(s) in 0.0320 seconds

hbase(main):015:0> put 'tab3', 'row-001', 'cf1:a', 123.456
0 row(s) in 0.0240 seconds

hbase(main):016:0> get 'tab3', 'row-001', {COLUMN => 'cf1:a', VERSIONS => 3}
COLUMN                        CELL                                                                              
 cf1:a                        timestamp=1510183532925, value=123.456                                            
 cf1:a                        timestamp=1510183529338, value=123                                                
2 row(s) in 0.0330 seconds

hbase(main):012:0> create 'tab3',{NAME=>"cf1", VERSIONS=>5}

0 row(s) in 1.3510 seconds

=> Hbase::Table - tab3

hbase(main):013:0> describe 'tab3'

Table tab3 is ENABLED

tab3

COLUMN FAMILIES DESCRIPTION

{NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '5', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA

_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', B

LOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

1 row(s) in 0.0480 seconds

hbase(main):014:0> put 'tab3', 'row-001', 'cf1:a', 123

0 row(s) in 0.0320 seconds

hbase(main):015:0> put 'tab3', 'row-001', 'cf1:a', 123.456

0 row(s) in 0.0240 seconds

hbase(main):016:0> get 'tab3', 'row-001', {COLUMN => 'cf1:a', VERSIONS => 3}

COLUMN CELL

cf1:a timestamp=1510183532925, value=123.456

cf1:a timestamp=1510183529338, value=123

2 row(s) in 0.0330 seconds

Read all data

hbase(main):001:0> scan 'tab1'
ROW                           COLUMN+CELL                                                                       
 row-001                      column=cf1:a, timestamp=1510182483121, value=First value                          
 row-001                      column=cf1:b, timestamp=1510182504781, value=123                                  
 row-001                      column=cf1:c, timestamp=1510182511625, value=123.456                              
 row-001                      column=cf1:d, timestamp=1510182537839, value=Replacement                          
1 row(s) in 0.5110 seconds

hbase(main):002:0> scan 'tab3'
ROW                           COLUMN+CELL                                                                       
 row-001                      column=cf1:a, timestamp=1510183532925, value=123.456                              
1 row(s) in 0.0410 seconds

hbase(main):003:0> scan 'tab3', VERSIONS=>3
ROW                           COLUMN+CELL                                                                       
 row-001                      column=cf1:a, timestamp=1510183532925, value=123.456                              
 row-001                      column=cf1:a, timestamp=1510183529338, value=123                                  
1 row(s) in 0.0440 seconds

hbase(main):001:0> scan 'tab1'

ROW COLUMN+CELL

row-001 column=cf1:a, timestamp=1510182483121, value=First value

row-001 column=cf1:b, timestamp=1510182504781, value=123

row-001 column=cf1:c, timestamp=1510182511625, value=123.456

row-001 column=cf1:d, timestamp=1510182537839, value=Replacement

1 row(s) in 0.5110 seconds

hbase(main):002:0> scan 'tab3'

ROW COLUMN+CELL

row-001 column=cf1:a, timestamp=1510183532925, value=123.456

1 row(s) in 0.0410 seconds

hbase(main):003:0> scan 'tab3', VERSIONS=>3

ROW COLUMN+CELL

row-001 column=cf1:a, timestamp=1510183532925, value=123.456

row-001 column=cf1:a, timestamp=1510183529338, value=123

1 row(s) in 0.0440 seconds

Delete data from one cell

hbase(main):004:0> delete 'tab1', 'row-001', 'cf1:d'
0 row(s) in 0.1030 seconds

hbase(main):005:0> scan 'tab1'
ROW                           COLUMN+CELL                                                                       
 row-001                      column=cf1:a, timestamp=1510182483121, value=First value                          
 row-001                      column=cf1:b, timestamp=1510182504781, value=123                                  
 row-001                      column=cf1:c, timestamp=1510182511625, value=123.456                              
1 row(s) in 0.0400 seconds

hbase(main):004:0> delete 'tab1', 'row-001', 'cf1:d'

0 row(s) in 0.1030 seconds

hbase(main):005:0> scan 'tab1'

ROW COLUMN+CELL

row-001 column=cf1:a, timestamp=1510182483121, value=First value

row-001 column=cf1:b, timestamp=1510182504781, value=123

row-001 column=cf1:c, timestamp=1510182511625, value=123.456

1 row(s) in 0.0400 seconds

hbase(main):012:0> delete 'tab3', 'row-001', 'cf1:a'
0 row(s) in 0.0050 seconds

hbase(main):013:0> scan 'tab3', VERSIONS=>3
ROW                           COLUMN+CELL                                                                       
0 row(s) in 0.0100 seconds

hbase(main):012:0> delete 'tab3', 'row-001', 'cf1:a'

0 row(s) in 0.0050 seconds

hbase(main):013:0> scan 'tab3', VERSIONS=>3

ROW COLUMN+CELL

0 row(s) in 0.0100 seconds

hbase(main):001:0> put 'tab3', 'row-001', 'cf1:a', 'abc'
0 row(s) in 0.5460 seconds

hbase(main):002:0> put 'tab3', 'row-001', 'cf1:a', 'def'
0 row(s) in 0.0040 seconds

hbase(main):003:0> scan 'tab3', VERSIONS=>3
ROW                           COLUMN+CELL                                                                       
 row-001                      column=cf1:a, timestamp=1510184160905, value=def                                  
 row-001                      column=cf1:a, timestamp=1510184156304, value=abc                                  
1 row(s) in 0.0730 seconds

hbase(main):005:0> delete 'tab3', 'row-001', 'cf1:a', 1510184160905
0 row(s) in 0.0240 seconds

hbase(main):006:0> scan 'tab3', VERSIONS=>3
ROW                           COLUMN+CELL                                                                       
0 row(s) in 0.0190 seconds

hbase(main):007:0> put 'tab3', 'row-001', 'cf1:a', 'abc'
0 row(s) in 0.0270 seconds

hbase(main):008:0> put 'tab3', 'row-001', 'cf1:a', 'def'
0 row(s) in 0.0100 seconds

hbase(main):009:0> scan 'tab3', VERSIONS=>3
ROW                           COLUMN+CELL                                                                       
 row-001                      column=cf1:a, timestamp=1510184447024, value=def                                  
 row-001                      column=cf1:a, timestamp=1510184441361, value=abc                                  
1 row(s) in 0.0320 seconds

hbase(main):010:0> delete 'tab3', 'row-001', 'cf1:a', 1510184441361
0 row(s) in 0.0080 seconds

hbase(main):011:0> scan 'tab3', VERSIONS=>3
ROW                           COLUMN+CELL                                                                       
 row-001                      column=cf1:a, timestamp=1510184447024, value=def                                  
1 row(s) in 0.0150 seconds

hbase(main):001:0> put 'tab3', 'row-001', 'cf1:a', 'abc'

0 row(s) in 0.5460 seconds

hbase(main):002:0> put 'tab3', 'row-001', 'cf1:a', 'def'

0 row(s) in 0.0040 seconds

hbase(main):003:0> scan 'tab3', VERSIONS=>3

ROW COLUMN+CELL

row-001 column=cf1:a, timestamp=1510184160905, value=def

row-001 column=cf1:a, timestamp=1510184156304, value=abc

1 row(s) in 0.0730 seconds

hbase(main):005:0> delete 'tab3', 'row-001', 'cf1:a', 1510184160905

0 row(s) in 0.0240 seconds

hbase(main):006:0> scan 'tab3', VERSIONS=>3

ROW COLUMN+CELL

0 row(s) in 0.0190 seconds

hbase(main):007:0> put 'tab3', 'row-001', 'cf1:a', 'abc'

0 row(s) in 0.0270 seconds

hbase(main):008:0> put 'tab3', 'row-001', 'cf1:a', 'def'

0 row(s) in 0.0100 seconds

hbase(main):009:0> scan 'tab3', VERSIONS=>3

ROW COLUMN+CELL

row-001 column=cf1:a, timestamp=1510184447024, value=def

row-001 column=cf1:a, timestamp=1510184441361, value=abc

1 row(s) in 0.0320 seconds

hbase(main):010:0> delete 'tab3', 'row-001', 'cf1:a', 1510184441361

0 row(s) in 0.0080 seconds

hbase(main):011:0> scan 'tab3', VERSIONS=>3

ROW COLUMN+CELL

row-001 column=cf1:a, timestamp=1510184447024, value=def

1 row(s) in 0.0150 seconds

Delete data from one row

hbase(main):012:0> scan 'tab1'
ROW                           COLUMN+CELL                                                                       
 row-001                      column=cf1:a, timestamp=1510182483121, value=First value                          
 row-001                      column=cf1:b, timestamp=1510182504781, value=123                                  
 row-001                      column=cf1:c, timestamp=1510182511625, value=123.456                              
1 row(s) in 0.0510 seconds

hbase(main):013:0> deleteall 'tab1', 'row-001'
0 row(s) in 0.0110 seconds

hbase(main):014:0> scan 'tab1'
ROW                           COLUMN+CELL                                                                       
0 row(s) in 0.0100 seconds

hbase(main):012:0> scan 'tab1'

ROW COLUMN+CELL

row-001 column=cf1:a, timestamp=1510182483121, value=First value

row-001 column=cf1:b, timestamp=1510182504781, value=123

row-001 column=cf1:c, timestamp=1510182511625, value=123.456

1 row(s) in 0.0510 seconds

hbase(main):013:0> deleteall 'tab1', 'row-001'

0 row(s) in 0.0110 seconds

hbase(main):014:0> scan 'tab1'

ROW COLUMN+CELL

0 row(s) in 0.0100 seconds

Make changes to an existing table

hbase(main):015:0> alter 'tab1', NAME=>'cf1', VERSIONS=>5
Updating all regions with the new schema...
1/1 regions updated.
Done.
0 row(s) in 2.1190 seconds

hbase(main):016:0> describe 'tab1'
Table tab1 is ENABLED                                                                                           
tab1                                                                                                            
COLUMN FAMILIES DESCRIPTION                                                                                     
{NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '5', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA
_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', B
LOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                                                  
1 row(s) in 0.0500 seconds

hbase(main):017:0> put 'tab1', 'row-001', 'cf1:d', 456
0 row(s) in 0.0190 seconds

hbase(main):018:0> put 'tab1', 'row-001', 'cf1:d', 'Replacement'
0 row(s) in 0.0090 seconds

hbase(main):019:0> scan 'tab1', VERSIONS=>3
ROW                           COLUMN+CELL                                                                       
 row-001                      column=cf1:d, timestamp=1510185095846, value=Replacement                          
 row-001                      column=cf1:d, timestamp=1510185082718, value=456                                  
1 row(s) in 0.0270 seconds

hbase(main):015:0> alter 'tab1', NAME=>'cf1', VERSIONS=>5

Updating all regions with the new schema...

1/1 regions updated.

Done.

0 row(s) in 2.1190 seconds

hbase(main):016:0> describe 'tab1'

Table tab1 is ENABLED

tab1

COLUMN FAMILIES DESCRIPTION

{NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '5', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA

_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', B

LOCKSIZE => '65536', REPLICATION_SCOPE => '0'}

1 row(s) in 0.0500 seconds

hbase(main):017:0> put 'tab1', 'row-001', 'cf1:d', 456

0 row(s) in 0.0190 seconds

hbase(main):018:0> put 'tab1', 'row-001', 'cf1:d', 'Replacement'

0 row(s) in 0.0090 seconds

hbase(main):019:0> scan 'tab1', VERSIONS=>3

ROW COLUMN+CELL

row-001 column=cf1:d, timestamp=1510185095846, value=Replacement

row-001 column=cf1:d, timestamp=1510185082718, value=456

1 row(s) in 0.0270 seconds

Disable and enable a table. If we want to delete a table or change its settings, as well as in some other situations, we need to disable the table first, using the disable command. We can re-enable it using the enable command.

Vim

hbase(main):021:0> disable 'tab3' 0 row(s) in 2.3520 seconds hbase(main):023:0> enable 'tab3' 0 row(s) in 1.2750 seconds

1
2
3
4
5

hbase(main):021:0> disable 'tab3'
0 row(s) in 2.3520 seconds

hbase(main):023:0> enable 'tab3'
0 row(s) in 1.2750 seconds

Delete the table

hbase(main):022:0> list
TABLE                                                                                                           
tab1                                                                                                            
tab2                                                                                                            
tab3                                                                                                            
3 row(s) in 0.0130 seconds

=> ["tab1", "tab2", "tab3"]

hbase(main):024:0> disable 'tab3'
0 row(s) in 2.2570 seconds

hbase(main):025:0> drop 'tab3'
0 row(s) in 1.2970 seconds

hbase(main):026:0> list
TABLE                                                                                                           
tab1                                                                                                            
tab2                                                                                                            
2 row(s) in 0.0190 seconds

=> ["tab1", "tab2"]
hbase(main):027:0> drop 'tab1'

ERROR: Table tab1 is enabled. Disable it first.

Here is some help for this command:
Drop the named table. Table must first be disabled:
  hbase> drop 't1'
  hbase> drop 'ns1:t1'

hbase(main):022:0> list

TABLE

tab1

tab2

tab3

3 row(s) in 0.0130 seconds

=> ["tab1", "tab2", "tab3"]

hbase(main):024:0> disable 'tab3'

0 row(s) in 2.2570 seconds

hbase(main):025:0> drop 'tab3'

0 row(s) in 1.2970 seconds

hbase(main):026:0> list

TABLE

tab1

tab2

2 row(s) in 0.0190 seconds

=> ["tab1", "tab2"]

hbase(main):027:0> drop 'tab1'

ERROR: Table tab1 is enabled. Disable it first.

Here is some help for this command:

Drop the named table. Table must first be disabled:

hbase> drop 't1'

hbase> drop 'ns1:t1'

Exit the HBase Shell

Vim

hbase(main):028:0> exit nosql@nosql-virtual-machine:~/Desktop/downloads/hbase-1.2.6/bin$

1
2

hbase(main):028:0> exit
nosql@nosql-virtual-machine:~/Desktop/downloads/hbase-1.2.6/bin$

Stop HBase

nosql@nosql-virtual-machine:~/Desktop/downloads/hbase-1.2.6/bin$ ./stop-hbase.sh 
stopping hbase...................
nosql@nosql-virtual-machine:~/Desktop/downloads/hbase-1.2.6/bin$ jps
46732 Jps

nosql@nosql-virtual-machine:~/Desktop/downloads/hbase-1.2.6/bin$ ./stop-hbase.sh

stopping hbase...................

nosql@nosql-virtual-machine:~/Desktop/downloads/hbase-1.2.6/bin$ jps

46732 Jps

Note that after issuing the command, it can take several minutes for the processes to shut down. That's why use the jps to be sure that the HMaster and all other components are shut down.