Skip to content

Graph stores

Informations presented on this page may be incomplete or outdated. You can find the most up-to-date version reading my book NoSQL. Theory and examples

In this part we cover the following topics

What do we have, what do we miss so far

What do we have so far? Quite a lot:

  • Relational databases with their normal forms and ACID transaction model.
  • Column families store with their strong aggregation oriented model.
  • Key-values store with their simplicity and speed.
  • Document store with their flexibility of key-value stores but also ability to search more complex data structures typical for relational databases.

All of them stores data in one of the characteristic way. So we have few methods to store data, but our data are dumb -- they are just data. Anything more than this we have to infer on our own. There are situation when we want our data to speak. Sometimes more than data themselves we care about information hidden in a way we organize them. For example, the relationship between things (data), may be more important than the things (data) themselves. If it's the case, this is where graph database systems appear.

We can stated that we don't need a new type of database (store) because we can model graphs in almost any other database type we know so far. In some sense it's true, but thing which differs is not database itself (understood as a store -- a place we store a given type of data in a predefined way) but rather the way we work with it.

Graph as something much more than a data structure

Graph stores are built around the simple and general-purpose node-relationship-node data structure. The key question is: Do we really need a new database type?

Because in this case we are talking about relationships, we need some kind of joins so we can relate (connect, join) one object to other. We can't process data in total isolation from each other which is a key feature of key-value or document stores. If it is true that we can save graph data in these databases, they have no graph traversal capabilities. In a pure key-value or document store, graph traversal logic must be implemented entirely in application code. In this aspect both offers less than relational databases.

Why realtional databases are bad for graphs modeling

Consider the following database with graph like dependencies

Now if we want to find all persons who have ever worked with XYZ in the same project we can use the following SQL query:

This SQL queries person to find XYZ person, then team_member and project to find the project XYZ has worked in, then team_member and person again to find other persons who have been in those project.

As we can see, such the title task is not unfeasible, but as we traverse more we have to add another three joins each time we increase the search depth. Furthermore, there is no syntax that allows us to search to an arbitrary depth. So for instance, if we want to expand the whole graph, we can’t because there is no way to know in advance how many levels we need to examine.

What is a graph and how it is used as data store?

Like the relational database, but unlike many non relational systems, graph databases are based on a strong theoretical foundation. Graph theory is a long-established branch of mathematics, with many practical applications in computer science, physics but also sociology or medicine.

The most basic NoSQL database -- a key-value store -- uses two types of a data fields: the key and the value. The same way, in graph store discussed in this chapter we can distinguish three types of data fields: nodes, relationships, and properties.

A graph is a fairly abstract concept for presenting and studying relations between objects includes two basic components: vertices and edges.

  • Vertices also known as a nodes are usually representations of a distinct real-world objects. A vertex can represent virtually any entity that has a relation with another entity. Vertices can be people, cars, organizations, telephone numbers, or even more abstract things like ideas. A vertex represents an entity is marked with a unique identifier analogous to a primary key in a relational databases. Visual representation uses dot or circle with a node name next or inside it.
  • Edges also known as a relationships or arcs represent connections between these objects or show how one object influence on others. There are two types of edges: directed and undirected. Directed edges have, which should not be surprising, a direction. This is used to indicate how the relationship, as modeled by the edge, should be interpreted. For example, in a family relations graph the parent relation is one way and should have a direction while sibling may stay undirected because is true for both directions. They are typically represented as arcs (lines) or arrows between circles in diagrams.
  • Properties Properties are like a properties known from object-oriented programming. They describes features (additional information) related to vertex or edge and are typically represented as a short text next to object (vertex or edge). A very commonly used property is called the weight of an edge. Weights represent some value about the relationship; the strength of the influence of one object on the other. In general, weights can represent cost, distance, or any other measure of the relationship between objects represented by vertices.

Having a set of vertices connected by some edges we can define a path. A path through a graph is a set of vertices along with the edges between those vertices.

Todo (id=graph store:graph model 1): Add an example of something modeled as undirected graph
Todo (id=graph store:graph model 2): Add an example of something modeled as directed graph
Todo (id=graph store:graph model 3): Add an example of path

A graph store is defined as a system that contains a set of nodes (vertices) and connections (edges) between them that, when combined, create a data structure known as a graph. Other words, in a graph form we can store objects along with their properties (features) and relationships between them.

As we stated before, a pure graph is just a mindless static data structure. Thing that counts is the ability to query it. Graph queries are similar and related to traversing nodes in a graph. We can query to ask things like these:

  • What is the shortest path between two nodes in a graph?
  • What nodes have neighboring nodes that have specific properties?
  • Given any two nodes in a graph, how similar are their neighboring nodes?
  • Find a predecessors of a current directed graph node that has no predecessors.

In graph databases we can perform all common database operations include inserting, reading, updating, and deleting data. We can do aggregations as well. Graph databases are also well suited for an additional set of operations. Specifically, operations can be used to follow paths or detect repeating patterns in relationships between vertices.

Graph stores -- characteristics

  • Use to analyze interactions and relations A natural place of graph stores usage are applications that need to analyze interactions and relations between objects.
  • Use where graphs are used Everything we can do by creating and querying a graphs, we can do also in graph store. Graph traversal -- visit all nodes in a graph in a particular manner -- is one of the (non-trivial) example. In particular, graph databases are useful for any business problem that has complex relationships between objects, allowing quickly analyze complex network structures and find patterns within these structures. They can also be considered as a replacement for rules-based engines (two examples of which I have ever worked are Drools and Jess).
  • Native is better than emulated Graph processing can be performed on databases regardless of their type. The logical model of a graph store can be implemented on almost any database system. However, as it was stated earlier, performance on non-dedicated systems, like relational, degrades as the depth of the graph traversal increases. Enabling efficient real-time graph processing in practice, requires a way to move through the graph structure without any intermediate layer, like for example relational indexes. This index-free navigation is referred to as index-free adjacency. In a native graph database utilizing index-free adjacency, each node possess a direct information about the physical location of all adjacent nodes. In consequence, which is not a big surprise, native graph stores outperform alternative implementations for graph traversal operations.
  • No scale out Unlike other NoSQL patterns discussed earlier, graph stores are difficult to scale out on multiple servers due to the close connectedness of each node in a graph. This is opposite to for example document stores where every document is analyzed independently of other documents. In this case, data can be replicated on multiple servers to enhance read and simple query performance, but writes to multiple servers and graph queries that span multiple nodes located on different servers are complex to implement.
  • No joins Graph databases show explicit relations between entities. In relational databases, connections are not direct. Instead, two entities share a common attribute value, known as a (foreign) key (see an example from a Why relational databases are bad for graphs modeling section). To find connections in a relational database, we must perform a join. A join entails looking up a value from one table in another table. With graph stores we can query faster by avoiding joins. Instead of performing joins, we follow edges from vertex to vertex. This is a much simpler and very much faster operation. Moreover, following connections we have no depth limit. Relational model requires new set of joins for every next connection level which must be manually maintain.
  • Simplified modeling As for other NoSQL databases, working with graph stores can simplify the modeling process. There is no need to create dummy tables just to model many-to-many relations; instead, they are explicitly modeled using edges. Moreover we can easily use multiple types of edges to allow database designers model multiple type of relations between entities. This could be useful for example, when modeling logistic company to consider different types of transportation (road, rail, see and air). Each has different properties, such as time to deliver, cost, size and weight requirements or additional law regulations. Of course multiple relations can also be modeled in relational databases, but they are explicit and easy to understand when using a graph store.

As we can see, graph processing can be difficult to combine with other data store models. Also processing big datasets which demands distributed approach could be a source of problems. Both problems can be solved by using graph compute engines. A graph compute engine implements efficient graph processing algorithms and exposes graph APIs, but it does not assume or require that the data be stored in the index free adjacency graph format.

Resource Description Format

Todo (id=graph store:verify_rdf_sparql): Verify this section focusing on RDF and SPARQL
Talking about graphs stores we have to mention about the Resource Description Format (RDF, or in wider meaning the Resource Description Framework).

The RDF is a web standard developed in the 1997 (and adopted as a W3C recommendation in 1999) for the modeling of web resources and the relationships between them. It represents one of the earliest standards for representing and processing graph data. RDF was intended as a means of creating a formal database of resources on the Web together with the dependencies that these resources relied on. It is used to describe (not to display) knowledge contained on the Internet, in a manner understandable to computers (easily processed by computer programs). Information in RDF is expressed in triples, conceptually of the form

for instance:

RDF uses specific names for the general node-relationship-node structure. The source node is the subject, and the destination node is the object. The relationship that connects them together is the predicate. The entire structure is called an assertion.

To uniquely identify nodes, URIs are used. RDF graphs can be stored in a variety of formats including XML, or even within tables inside a relational database. In the following RDF expressed in XML

URI identifies the resource, element name is a property while lightsaber its value.

We can query RDF graph with SPARQL (SPARQL Protocol and RDF Query Language) which is a semantic query language for databases. It was made a standard by the RDF Data Access Working Group (DAWG) of the World Wide Web Consortium, and is recognized as one of the key technologies of the semantic web. In March 2013, SPARQL 1.1 became an official W3C Recommendation.

SPARQL allows to write queries against what can be descriptively called key-value data or, more formally, data that follow the RDF specification of the W3C. To make it possible, the entire database should be a set of subject-predicate-object triples.

A SPARQL query

comprises, in order:

  • Prefix declarations, for abbreviating URIs.
  • Dataset definition, stating what RDF graph(s) are being queried.
  • A result clause, identifying what information to return from the query.
  • The query pattern, specifying what to query for in the underlying dataset.
  • Query modifiers, slicing, ordering, and otherwise rearranging query results.
  • The example below demonstrates a simple query that returns names and prices of every item in the Star Wars Store's dataset:

    This query joins together all of the triples with a matching subject, where the type predicate a (a is a SPARQL keyword which is a shortcut for the common predicate rdf:type, giving the class of a resource) is an item (src:Item), and the item has (one or more) name (src:name) and price(s) (src:price).

    Consider another SPARQL query example that models the query What are all the country capitals in Africa? [W10]:

    Working with graph store

    Graph store design

    There is no one correct way to model a graph store for all possible problems. As in the case of other NoSQL databases, also in graph stores case the way we approach design strongly depends on kinds of queries or analysis we will perform on the data. Simply speaking, queries drive the design of graph databases. By their nature, graph databases are well suited to all problem domains that are easily described in terms of objects (represented by) nodes and relations (represented by links or connections) between them. Of course, because other NoSQL databases as well as relational databases are also well suited to modeling objects, we should pay special attention to the type of queries and analysis we anticipate to be used. Some of them, listed below, are premises for using graph store.

    • How many cities are in the shortest route from city A to city B ?
    • Calculate a cost of every route (separately) between city A and city B.
    • What is the cheapest route between city A and city B?
    • Find all cities we can reach starting from cite A.
    • How many route segments between citi A and city B have a cost that is less than a given value?
    • How many routes lead to city A?
    • What would happened if we add/delete city/route segment?

    Working with graph stores, we start with domain-specific queries. They should be translated (mapped) into graph-specific queries involving nodes, edges and their properties. So, if we decide that graph store should be used, the next step is to find a correct abstraction to express real-life objects as nodes and relations between them as edges. If we do this, our domain-specific queries can be reformulated in graph-specific language

    • How many intermediate nodes are in the shortest (in terms of nodes number) path from node A to node B?
    • Calculate a cost of every path (separately) between node A and node B.
    • What is the cheapest path between node A and node B?
    • Find all nodes we can reach starting from node A.
    • How many edges between node A and node B have a cost that is less than a given number?
    • How many nodes are linked to node A?
    • What would happened if we add/delete node/edge?

    As we can see, some queries are based on paths, while other take into account the global structure of the graph.

    Despite the fact that graph stores systems available today can work with millions of nodes and edges using a single server we should be careful when using graph algorithms. Some that run in reasonable time with small graphs, may take too long to complete when the graph grows larger. Things working during test (because of small datasets) may stuck in real working application (where datasets may be much larger). That is why we should consider the impact on performance when the number of any crucial components like users, nodes, edges or properties grow.

    Querying a graph

    Depending on our needs, there are two main approaches which can be used: declarative querying or imperative querying.

    • Declarative querying
      Declarative languages, are well suited to problems that require selecting objects (nodes) based on their properties. It is also convenient when we need to apply aggregate operations, such as grouping or summing values from vertex properties. As any declarative language, it allows us to state what we want to do with our graph data (select, insert, update or delete) without requiring us to describe exactly how to do it. This is very close to other declarative language we know well: SQL.

      Very basic SQL query

      which we can describe in words as
      Find and return all rows from customer_details table.
      This can be expressed in declarative graph query language -- in this case in Neo4j's Cypher Query Language -- as

      which we can describe in words as
      Find and return all nodes of Customer type.
      Cypher is inspired by SQL with the concept of pattern matching taken from SPARQL.

      The real power of graph queries is hidden in ability to express more complex patterns between nodes. This can be [W3]

      • Query about relationship types, for example: -[:KNOWS|:LIKE]-> which is a pattern for all relations of KNOWS or LIKE type.
      • Represent a relation as variable for future usage, for example -[rel:KNOWS]-> which is a pattern for all relations of KNOWS type represented in a subsequent query part under the rel variable name.
      • Require that relation fulfills some additional requirements, for example -[{since:2010}]-> which is a pattern for relations having property since of value 2010.
      • Structural information for paths of variable length, for example -[:KNOWS*..4]-> which is a pattern for a variable length path of between 1 and 4 relationships, where the first is KNOWS.

      This is just a short example to let us know an idea related with declarative querying languages. Following this idea we can query about nodes and their properties , etc. More can be found in [W4].

      Based on given examples, it's worth to note that Cypher uses ASCII-Art syntax to represent patterns, making queries easy to read and recognize as part of our graph data.

    • Imperative querying
      If we want control over how the query is executed, namely the way our query retrieves nodes and edges, we should consider using an imperative query language. Let's see a few examples written in Gremlin [W7].

      • Get all the vertices in the graph (we will use a term vertex instead of node because this term is used in official Gremlin's documentation).
      • Get a vertex with the unique identifier of 1
      • Get the value of the name property on vertex with the unique identifier of 1
      • Get the edges with the label enemy for the vertex with the unique identifier of 1
      • Get the names of the creatures that the creature represented by vertex with the unique identifier of 1 classify as enemy
      • Get the names of the creature that the creature represented by vertex 1 knows who are over the age of 896.
      • Group and count distinct nodes labels
      • What are the names of Vaders's enemies' enemies?
      • What is the height of the lowest Jedi?
      • As we can see in each case we have precise information, how we should get the final result. For example the query

        can be read:

        • g the selected graph.
        • V from the selected graph get vertex with the unique identifier of 1.
        • outE('enemy') traverse outgoing enemy edges from previously selected set of vertices (in this case there is only one).
        • inV() get incoming head vertex of the previously selected edge; other words, get vertex pointed by previously selected edge.
        • values('name') from the previously selected vertex, get the value of a name property.

        As we can see, Gremlin traversal tells the traverser how to proceed at each step in the traversal. This approach is key feature of any imperative language -- to do something, we have to say how we should do this. For example, the above imperative traversal first places a traverser at the vertex with the unique identifier of 1. That traverser then follow any outgoing enemy edges. Each edges points to some vertex, so next the traverser focus on it to get the value of its name property. This traversal is imperative in that it tells the traverser to go here, then go there and do this in an explicit, procedural manner.

        Gremlin supports also declarative graph pattern matching similar to SPARQL. In consequence, a Gremlin traversal can be written in either an imperative (procedural) manner, a declarative (descriptive) manner, or in a hybrid manner containing both imperative and declarative aspects.

        A great thing about Gremlin is that its creators took a special care about unifying database and programmers worlds.

        Classic database query languages, like SQL, were conceived as being fundamentally different from the programming languages that would ultimately use them in a production setting. For this reason, classical databases require the developer to code both in their native programming language as well as in the database's respective query language. An argument can be made that the difference between query languages and programming languages are not as great as we are taught to believe. Gremlin unifies this divide because traversals can be written in any programming language that supports function composition and nesting (which every major programming language supports). In this way, the user's Gremlin traversals are written along side their application code and benefit from the advantages afforded by the host language and its tooling (e.g., type checking, syntax highlighting, dot completion, etc.). Various Gremlin language variants exist including: Gremlin-Java, Gremlin-Groovy, Gremlin-Python, Gremlin-Scala, etc. [W7]

      Working example, 1: Gremlin basics