In this part we cover the following topics
- Everything changes. New trend of XXI century
- Do we really need time series databases?
- Working example, 1: TimescaleDB
- Working example, 2: DalmatinerDB
- Working example, 3: Riak TS
Yet the fundamental conditions of computing have changed dramatically over the last decade. Everything has become compartmentalized. Monolithic mainframes have vanished, replaced by serverless servers, microservers, and containers. Today, everything that can be simpler and work as an independent component is a component.
In some sense, we are going back to our roots. I hope that all we know the KISS (Keep it simple, stupid) Unix principle. In its heart this principle states that most systems work best if they are kept simple rather than made complicated; therefore simplicity should be a key goal in design and unnecessary complexity should be avoided. The KISS acronym was reportedly coined by Kelly Johnson, lead engineer at the Lockheed Skunk Works (creators of the Lockheed U-2 and SR-71 Blackbird spy planes, among many others). The essence of the sentence was to create designed aircraft in such a simple way that every average gifted mechanic could fix them in field conditions and using simple tools. Later this rule has been transferred and adapted to many other fields of science, engineering and social life.
Similar minimalist concepts
- Occam's razor This principle states that among competing hypotheses, the one with the fewest assumptions should be selected.
- Leonardo da Vinci's Simplicity is the ultimate sophistication.
- Mies Van Der Rohe's Less is more.
- Bjarne Stroustrup's Make simple tasks simple.
- Antoine de Saint Exupéry's It seems that perfection is reached not when there is nothing left to add, but when there is nothing left to take away.
- Attributed to Albert Einstein Make everything as simple as possible, but not simpler.[W11]
On the other side, almost every component of real object is full of sensors. Mankind is crazy about measuring almost every aspect of existence either living being or artificial object. Cars, phones, human bodies and even clothes or social groups have now, or will have in a near future, a sensor or somehow are controlled and became a source of time-related information. So now, everything is emitting an endless stream of metrics and events or time series data. What is more, we do not stop on collecting data and we use them in applications which based on measures how things change over time. Where time isn’t just a metric, but a primary axis. This means that the underlying platforms need to evolve to support these new workloads trend.
Having this in mind, we can define time series data and hence time series databases as
- a sequence of data points, measuring the same thing over time, stored in time order or
- a series of numeric values, each paired with a timestamp, and identified by a set of tags or better
- a data that collectively represents how an object (system / process / behavior) changes over time.
In short: yes, we do. Let the shortest argument be a software developer usage tendency: at the beginning of 2018, time-series databases (TSDBs) have emerged as the fastest growing segment of the database industry over the past 24 months. [W12]
For some reason the use of databases has increased significantly. Developers instead of using well known relational databases or any other existing NoSQL, have turned to other alternative solutions. And reason, as for graph or column stores, is not that we can't store time data in any existing storage system. Again, thing which differs is not database itself (understood as a store -- a place we store a given type of data in a predefined way) but rather the way we work with it.
A Time Series Database (TSDB) is a database optimized for time-stamped or time series data. Time series are simply measurements or events that are tracked, monitored, downsampled, and aggregated over time. This could be anything: server metrics, application performance monitoring, network data, sensor data, events, clicks, trades in a market, and many other types of analytics data. The key difference with time series data from regular data is that we are always asking query it over time. A time series database is built specifically for handling metrics and events or measurements that are time-stamped and is optimized for measuring their changes over time. We can distinguish some properties that make time series data very different than other data workloads.
- Data lifecycle management. With time series databases, it’s common to keep high precision data only for a short period of time. This data is aggregated and downsampled into longer term trend data. This means that for every data point that goes into the database, it will have to be deleted after its period of time is up. This kind of data lifecycle management is difficult for application developers to implement on top of other databases.
- The data that arrives is almost always recorded as a new entry. Time series data workloads are generally append-only; we use rather INSERTs, not UPDATEs. While they may need to correct erroneous data after the fact, or handle delayed or out of order data, these are exceptions, not the norm. The practice of recording each and every change to the system as a new, different row allows us to measure change: analyze how something changed in the past, monitor how something is changing in the present, predict how it may change in the future.
- Data summarization. With a time series database, it is common to request a summary of data over a large time period. This requires going over a huge range of data points to perform some computation.
- Time series data piles up very quickly.
For performance reasons, on disk, the data in time series databases is organized in a columnar style format where contiguous blocks of time are set for the measurement, tagset, field. So, each field is organized sequentially on disk for blocks of time, which make calculating aggregates on a single field a very fast operation. [W14]
Storage efficiency is extremely important. The purpose built TSDB's only consume a few bytes of storage per data point. Why this is important becomes clear when we start to add up the costs.
Let's say we have 1 server, and we want to record a single metric for its overall cpu percentage utilisation over a year at a 1 second interval. That's 31,536,000 points that need to be stored. With for example DalmatinerDB that takes up 31MiB on disk and at the other end of the scale it would use 693MiB of space on disk with Elasticsearch.
Now let's say we have to store 3 million metrics per second. With DalmatinerDB that's 93TiB of disk space needed for a years worth of data. With Elasticsearch it's just over 2PiB. On Amazon's S3 (Amazon Simple Storage Service) that would be the difference of 3k USD per month vs 63k USD per month. [W13]