Skip to content

Column databases

In this part we cover the following topics


Why it's worth to talk about column databases?

When I started working on this material, I thought to skip column databases. In some sense, we can say that this is more about physical data organization than philosophy or general, implementation independent, ideas related with data storage. Then it came the time to prepare graph stores material and it was the first time I have realized that sometimes thing which differs is not database itself (understood as a store -- a place we store a given type of data in a predefined way) but rather the way we work with it. For example a pure graph is just a mindless static data structure which can be implemented in almost any existing database we know. Thing that counts is the ability to query it efficiently.


I have frequently mentioned to my students a Ludwig Wittgenstein's sentence: The limits of my language mean the limits of my world. This, in computer science world, states that the way we write our software, implement algorithms etc. is determined by programming languages we know. The same way cultures we know, determine and limit how we think about world which sorrounds us. The left to right, top to bottom organization of European written languages, and in consequence digits and mathematical formulas, has programmed our minds to visualize and percept data in row format. Even working with file, which by its nature is a stream of bytes -- data written in one long line -- we tend to think about it as organized into rows. But no matter how convenient and familiar this format may be, it is not always the best way to organize data physically.

Have in mind that almost all kids are totally free of such limits. They know nothing about conventions and rules, so writes texts as they only can, the way they think may be useful or possible. Consider my son's text written in Polish: miejsce na prezent (in English: a place for a gift)

In Polish: miejsce na prezent, in English: a place for a gift

Paradoxically we can say that the more columns we have the less probable is that we will process all the columns of a single row. Rather, very often we want to process the values of a single column across all rows. This requirement is addressed by column-oriented databases storing all data belonging to the same column physically together on disk.

It's also worth to note that the way we use database has chaged over time. Formerly, in relational era the most time-critical operations were CRUD operations very often performed on single record(s). Any reporting jobs analyzing entire tables were run from time to time in a background batch mode where query response time was not a critical issue. Now, we live in a data driven world. Everywhere we can find data and evertyhing can be a source of data. Increasing computing power allows us to proccss more data than ever before. Particularly we us a lot of different types of analytic and decision support software which often demand ability to work with the same feature (or small subset of all features) on all records than with a single record itself. For example, based on :::need an example???


The columnar way of thinking

In the row-oriented model, all columns for each row are co-located in the same disk blocks, while in a column-oriented, values for a specific column become co-located. This is the essence of the columnar idea: data for columns is grouped together on disk.

There are two big advantages to the columnar architecture.

  • The main benefit of keeping data in a column database is that some queries can become really fast. In a columnar architecture, queries that seek to aggregate the values of specific columns are optimized, because all of the values to be aggregated exist within the same disk blocks. Thanks to this, we can just read the data needed instead of searching up the values for each record row by row. During querying, columnar storage avoids going over non-relevant data.
  • The second key advantage for the columnar architecture is the ability to get higher compression rate. As we know, compression algorithms work primarily by removing redundancy within data values. Data that is highly repetitious -- especially if those repetitions are localized -- achieve higher compression ratios than data with low repetition. Since column data is of uniform type, the data type for each column is similar, we can get better compression when running compression algorithms on each column.

Because in real life nothing is for free a column-oriented architecture has also some penalty.
The key disadvantage of the column-oriented model is the overhead it imposes on single row operations. Operations that retrieve all the data for a given object (the entire row) are slower compared to row-oriented model. Another key weakness of a columnar architecture is an insert and update overhead for single rows.

In a columnar database, retrieving a single row involves assembling the row from each of the column stores for that table. The more fields we need to read per record, the less benefits we get from storing in a column-oriented fashion. In fact, if our queries are for looking up object-specific values only, row-oriented databases usually perform those queries much faster. A row-based system can retrieve the row in a single disk read, whereas numerous disk operations to collect data from multiple columns are required from a columnar database. However, in the majority of cases, whole-row operations are generally rare, and only a limited subset of data is retrieved.


Working example, 1: Druid


Working example, 2: Vertica