Skip to content

Document stores

In this part we cover the following topics


What is a document store -- basic ideas

All we know relational world. On the one hand it is well organized, predictable and safe but on the other very often limits what we can do and how. In complete opposition to it is key-value (and big table) store island emerging from NoSQL universe, where we can put what we want without any constraints but in the price of much worse abilities to scan the data we have. Somewhere between both worlds there can be located another one, which tries to combine their features. Document databases, it is what we will be talking about now, give us the flexibility of key-value stores but also offers ability to manage more complex data structures typical for relational databases. Typically for NoSQL databases, and unlike relational databases, document databases do not require us to define a common structure for all records in the data store. Document databases, however, do have some similar features to relational databases. For example, it is possible to query and filter collections of documents much as we do in a relational table. Of course, the syntax, or structure, of queries is different between SQL and NoSQL databases, but the general idea remains.

One common misunderstanding concerns what means the document term. It's not about electronic documents like word processing, spreadsheet or any other businessman-readable file. These are the types of things many people would probably think of when they see the word document. They have nothing to do with document databases, at least with respect to the NoSQL type of database we are going to talk in this section. Two well known formats: JSON (JavaScript Object Notation) and XML (eXtensible Markup Language) are commonly used to define documents. The input data and the output data structures are in one of the XML or JSON form. Both are used to define database contents but are not directly used to be stored as final values. Other words, inner data representation base on XML or JSON so a document database stores data as structured documents instead of tables.

Possibility to automatically index content is an undeniable advantage of using a document as a data. You may say, that indexes are large. It is true, but thanks to this, everything is searchable. There are use-cases when fast search (not in a sense of retrieve the value but rather scan what value has inside) is more important than space occupied by our data. Having advanced index, all (indexed) information can be quickly located, we can extract subset of documents satisfying certain set of criteria as well as return path lading from the top of our hierarchy (root) to the element we are looking for (leaf).

Both root and leaf terms suggest us that document have something in common with trees. Indeed, we can think of a document store as a tree-like structure. With this approach we can understand the set of document as a document collection organized as a tree. Such a tree of document have a single root element (we think this way although in practice often we have more root elements). Beneath the root element there is a sequence of branches, nodes and values. Each branch has a related path expression that shows us how to navigate from the root of the tree to any given branch, node, or leaf (value). Sometimes branch existence is an information we need, and sometimes a branch must have a given value at its end to be interpreted correctly. In this sense, documents contain both structure information and values (data).

Tree-like documents can have a very complex structure but even then a search API can remain simple and provides an easy way to select a document, subset of a document or extract subsections from a large number of documents without loading each document into memory. This is different from key-value case when values are retrieved quickly but there is no option to search them as fast -- every value must be loaded into memory and next programmatically examined against data we are looking for.


What is a document -- basic ideas

Although the term document was introduced in the previous section, it wasn't explained there. We stated only that document is not related to word processing, spreadsheet, text or graphic file. So we know what is not a document, but we still don't know what is a document. To be concise, we can say that document is a set of ordered key-value pairs. It's so simple like that. Taking a closer look into this definition we can separate three important parts.

  • Key-value pairs This is a basic building block of every document. How key-value works was described in a previous part: Key-value stores. As we may recall, key-value stores, when presented with a key, return the value (which is a BLOB in general) associated with that key. Keys are used to reference particular values. Values can be either basic or structured data types, in particular documents can be embedded within documents. The key-value store values lack a formal structure and aren’t indexed or searchable and the whole "logic" is hidden in the keys. Document stores work in the opposite manner: the key may be a simple number (like most ID's in relational databases) because it is never used or seen. For example in CouchDB we can choose our own ID that should be in the form of a string. Generally, UUID (Universally Unique IDentifier) is preferred to be used, to avoid collisions. On the other side, we can get almost any values' part out of a document store by querying it.
  • Set Because a document is a set, it has one instance of each key-value pair member.
  • Ordered (set) The order of key-value pairs matters in determining the identity of a document. For example the document

    is not the same document as

From what has already been said we can infer that document is rather very general and flexible data structure. They do not require predefined schemas and they readily accommodate variations in the structure of documents. Documents are organized into related sets called collections. Collection should contain entities of similar type even if their structure differ across documents. It is possible to store different types of documents in a single collection, because collections do not impose a consistent structure on documents in the collection. Of course, the less they differ, the better collection is, because we have to pay less attention to process different variants. One of the key parts of modeling document databases is deciding how we will organize our documents into collections.


The question of the schemas existence

Document stores do not require its structures to be formally defined or specified in any form. That is why we call them schema-less as we call any NoSQL database.

On the other hand, all necessary information are (or can be) inferred from documents within collections. So the same time there is no schema and there is a schema, but we should make a distinction between a formal specification of a structure (of documents, keys, and values) and the structure that is implied by the documents in a collection.

When there is no formal definition of structure we can faced with polymorphic schema. It might seem odd that a database can be without a schema (schemaless) while at the same time having many schemas (polymorphic schema). Polymorphic schema means there are many document structures implied in the set of documents that occur in a collection.

For example from the following collection

although there is no formal definition of structure, we can infer three similar definitions with some common parts. We can say, we have one definition with many variants.


Define documents with XML or JSON

As it has been said before, two well known formats: JSON and XML are commonly used to define documents. Both are used to define database contents but are not directly used to be stored as final values. Both have their own history and reasons why they should (not) be used.


Define documents with XML

Thanks to a very rich set of tools to assist with authoring, validation, searching, and transforming XML format was and still is a natural choice when document are concerned.

We will not explain XML ideas and details -- please use rich internet resources describing them. Instead we will mention the most basic set of XML tools, and this way, we will highlight the most important XML features.

  • DOM (Document Object Model) In short: XML documents can be treated as an objects.

    The DOM defines a standard for accessing and manipulating documents. It is a platform and language neutral interface that allows us to dynamically access and update the content, structure, and style of a document. It presents a document as a tree-structure with nodes correspondings to elements. In consequence, as for every tree, also in DOM familly tree terminology is widely used. The terms parent, child, and sibling are used to describe the relationships amongs elements. Parent nodes have children. Children on the same level are called siblings (brothers or sisters).

    • The top node is called the root.
    • Every node, except the root, has exactly one parent node.
    • A node can have any number of children.
    • A leaf is a node with no children.
    • Siblings are nodes with the same parent.

    Going back to our Star Wars example from Relational model: NORMAL FORMS as well as Column family (BigTable) stores: AGGREGATION RELATED MODEL we can consider an XML document called invoice.xml

    Bellow there is a simple code to manipulate elements of this document using DOM.

    Although far from perfection, this code shows how we can manipulate DOM elements to "transform" pure XML text file into something more readable as is showned bellow

    Document resulting DOM elements manipulations
  • XPath In short: every item has its unique location description. In some sense, path is an item ID.
    The XPath (XML Path Language) is a query language for selecting item (nodes) from a XML document. It is based on a tree representation of the XML document, and provides the ability to navigate around the tree, selecting nodes by a variety of criteria.

    In the table below we have listed some XPath expressions and their expected results

    XPath Expression Result
    /A Select the root element A.
    /A/B Select all elements B which are children of the root element A.
    /A/B/C Select all elements C which are children of B which in turn are children of the root element A.
    //C Select all elements C.
    //A/B Select all elements B which are children of A.
    /A/B/C/* Select all elements enclosed by elements /A/B/C.
    /*/*/*/D Select all elements A which have 3 ancestors.
    //* Select all elements.
    /A/B[1] Select the first B child of element A.
    /A/B[last()] Select the last B child of element A.
    /A/B[last()-1] Selects the last but one B element that is the child of the A element.
    /A/B[position()<3] Selects the first two B elements that are children of the A element.
    //@id Select all attributes @id.
    //B[@id] Select B elements which have attribute id.
    //B[@*] Select B elements which have any attribute.
    //B[not(@*)] Select A elements without an attribute.
    //B[@id='2'] Select all the A elements which have an id attribute with a value of a.
    //*[count(B)=2] Select elements which have two children B (function count() counts the number of selected elements).
    //*[count(*)=1] Select elements which have 1 children.
    //*[contains(name(),'C')] Select all elements name of which contain letter C.
    /A/B | //D Select all elements D and elements B which are children of root element A.
    /A/B[C>12.34] Selects all the B elements of the A element that B have a C element with a value greater than 12.34.
    /A/B[C>12.34]/C Selects all the C element of the B elements of the A element such that B have a C node with a value greater than 12.34.

    We can test them on a simple file

    in one of many available online XPath tester, for example XPath Tester or XPath Tester.

  • XQuery In short: XML document is searchable. Items can be located not only by ID (XPath) but also by specifying complex queries. XQuery is a language for finding and extracting elements and attributes from XML documents.

    XQuery stands for XML Query Language. It is a language to select subsets and substructures from a set of XML files. The result can be manipulated into something that is suitable to be used in, for example, a subsequent process, or to show in a web browser. XPath is used a lot in XQuery. XQuery is a superset of XPath that also provides FLWOR syntax, which is SQL-like.

    XPath language role

    The programming language XQuery defines FLWOR (pronounced flower) as an expression that supports iteration and binding of variables to intermediate results. FLWOR is an acronym: for, let, where, order by, return. FLWOR is loosely analogous to SQL's SELECT-FROM-WHERE and can be used to provide join-like functionality to XML documents.

    • for creates a sequence of nodes;
    • let binds a sequence to a variable;
    • where filters the nodes on a boolean expression;
    • order by sorts the nodes;
    • return gets evaluated once for every node.

    Nice introduction to FLOWR we can find at Blooming FLWOR - An Introduction to the XQuery FLWOR Expression webpage.

    Here are an examples of what XQuery could solve.

    • The following path expression (xquery_01.xqy):

      will select all the itemname elements under the invoiceitem elements that are under the invoice element such that invoiceitem has a itemprice element with a value that is higher than 30. In our case this should be lightsaber and black cloak.


      fulmanp@fulmanp-UL30VT:~/Pulpit/nosql$ xqilla xquery_01.xqy
      lightsaber
      black cloak

      The following FLWOR expression (xquery_02.xqy) will select exactly the same as the path expression above


      fulmanp@fulmanp-UL30VT:~/Pulpit/nosql$ xqilla xquery_02.xqy
      lightsaber
      black cloak

      FLWOR let us do much more than simply selct elements (what we can do with simple path expression). For examle (xquery_03.xqy), we can sort the result


      fulmanp@fulmanp-UL30VT:~/Pulpit/nosql$ xqilla xquery_03.xqy
      black cloak
      lightsaber
    • With FLWOR we can acomplish much more complex tasks (queries), like the following expressed in plain English

      Get all invoices from invoices.xml document that have less than three item types, order these invoices by increasing average price per item, and return a report including for each invoice customer number, amount spend on all items, number of all items and average price per item (xquery_04.xqy)

      The result could look like:

      Customer number Total invoice price Total invoice quantity Total price per item
      30 25 1 25
      20 380 6 63.33
    • XML schema In short: correctness of a XML document can be verified against its schema.
      The purpose of an XML Schema is to define the legal building blocks of an XML document. Although it is not perfect [Wikipedia:XML_Schema_(W3C):Criticism], it is videly used to specify:

      • the elements and attributes that can appear in a document;
      • the number of (and order of) child elements;
      • data types for elements and attributes;
      • default and fixed values for elements and attributes.

      A schema (invoice.xsd) for a previously introduced invoice.xml document used above may look like it is showned below


      fulmanp@fulmanp-UL30VT:~/Pulpit/nosql$ xmllint --noout --schema invoice.xsd invoice.xml
      invoice.xml validates
    • XSLT (Extensible Stylesheet Language Transformations) In short: XML documents are easily transformable to other formats.
      XSLT, which stands for Extensible Styles Language Transformations, enables us to transform XML documents into another form. For example, we can take an XML document from preceding point (invoice.xml), and with the following invoice_xml_latex.xsl

      transform it into LaTeX source file

      which allows us to print such a nice pdf document

      Other simple examples: Example: Transforming with XSLT stylesheets or XSLT Basic Example.


    Define documents with JSON

    Over the years XML has established its position as a standard format for file-based documents and for data interchange. Features listed in a previous section are undoubtedly classified as an advantages. On the other side it wouldn't be fair to silence about its drawbacks as a storage format. XML is justifiably criticized as being wasteful of space and computationally expensive to process. Its tags are verbose and repetitious which typically increases the amount of required storage and as a result, the XML language format is relatively expensive to parse. Despite its disadvantage, there were no alternative over the horizon, till the early 2000s when new technique called AJAX (Asynchronous JavaScript And XML) emerged.

    AJAX is not a single technology, but rather a group of technologies that can be used to implement a Web application that communicates with a server in the background, without interfering with the current state of the page. Before 2000, in the early-to-mid 1990s, most Web sites were based on complete HTML pages. Each user action required that a complete new page be loaded from the server. Each time the browser reloaded a page because of a partial change, all of the content had to be re-sent, even though only a small subset of the information had changed. This process was inefficient and led to unsatisfactory user experience: all page content disappeared, then the new page appeared. Another drawback was an additional, extra, load on the servers limiting total performance. This technique consists of several elements, of which the most important from our point of view are two:

    • JavaScript JavaScript is used to bring all related technologies together. Although used in the name, it can be de facto any (scripting) language functioning on the user's side.
    • XML The markup language through which the interchanged information should be described.

    XML was good but have aforementioned drawbacks and it soon turned out that could be easily replaced by some native to JavaScript format, namely JSON.

    It's difficult to say that JSON was invented and it's even more difficult to say by whom. Many people rather discovered then invented this data transfer and storage format independently of one another. Even Douglas Crockford, the man who coined and popularized the term JSON, states he discovered JSON rather than invented it. In his own words: A number of people independently discovered that JavaScript’s object literals were an ideal format for transmitting object-oriented data across the network. I made my own discovery in April of 2001 when I was CTO of State Software. [W9]

    The following, well known in this tutorial, Star Wars based example shows a possible JSON representation describing an invoice number 1 details

    Note JSON and XML are similar formats, but generally support dissimilar applications. XML document (and in consequence databases) excel for content management systems, organizing and maintaining textual data. Range of possible application starts from simple config files, through advanced build dependencies (Apache Maven) or UI description (XAML -- Extensible Application Markup Language ) to complicated office documents (ODF -- Open Document Format for Office Applications). JSON document (databases), on the other hand, mostly support modern web-based applications storing and modifying its dynamic content and interchanging data.

    Working with JSON the hierarchy of storage is typically as follows:

    • The basic unit of storage is a document, corresponding, more or less, to a row in a relational model. A document consist of one or more key-value pairs. Important is that documents can be also used as a values. Inside every document, we may have nested other documents or arrays. Arrays may also contain documents. This allows us to create a complex hierarchical structure.
    • A collection (called also (data) bucket) is a set of documents related to a common category of information. Looking for an analogy to the relational model, it would be a table. There are no formal rules to fall a document to a given collection -- it's up to the programmer / designer / modeler. Documents in a collection don’t have to be of the same type or have the same structure, though it is logical and convenient for it to be so.

    Bellow an example of an invoice collection is given.

    To some degree, XML-based document databases represent a previous generation of database system, while JSON databases can claim to be the next generation. In the near future, we can expect that support for JSON may stay a common feature to all databases.


    Document store remarks


    Normal forms

    Theoretically a document database could implement a third normal form schema exactly as it would be used within a relational system. And this is a trap for people with a rich relational background, starting work with NoSQL. They try to mimic relational model with NoSQL tools. Remember: don't do that! Never ever! It was not for NoSQL to model relations in it. Start to think differently!

    Instead of implementing normal forms, document databases more typically model data in a smaller number of collections, with nested documents representing master-detail relationships.


    Model determinism

    Again, as for other NoSQL databases / stores, data modeling in document databases is less deterministic than for relational databases because there is no equivalent of correct model defined in relational model by (third) normal form.

    Remember the fundamental difference affecting the way of modeling in relational and NoSQL model: a relational database is driven primarily by the nature of the data to be stored, while a document store is driven by the nature of the queries to be executed.

    In relational model we don't think about future use of our data. We care only how to store them in a somehow universal manner without breaking some general logical rules. With this model our data can be used by everyone for each task. In some sense, they are equally unsuitable for everyone and require additional processing each time (e.g. using object-relational mappings) in order to use them.

    In document stores we organize our data so it would be easier to use directly what we retrieve from it. In consequence data organization, its model, is determined by the way we are going to use them in our application. In some sense, model depends on application needs.

    Designing document database we depend more on heuristics, our intuition and some general rules of thumb than any formal, like normalization, rules. Having relational database, and looking at their tables we can say in isolation from real application if it would be perform efficient or not. It's not true for document stores, where considering database without telling about an application doesn't make any sense. We must consider how users will use the database:

    • how often it will be queried, inserted and updated,
    • how much querying, inserting and updating will be done,
    • what data (which part) will be queried, inserted and updated.


    Too much abstraction is not advisable

    As we know from the The question of the schemas existence section, we have no formal schemas in document stores and in consequence we have polymorphic schema. Polymorphic schema means there are many document structures implied in the set of documents that occur in a collection. Having this possibility it's very tempting to use more and more abstract documents. If we do this, we should ask ourselves if there is a situation when we write more and more separated code to process different document subtypes. If it's the case, we should better consider separating the variants into different collections. Poor collection design can adversely affect performance and slow our application.


    Schema-less can be tricky

    In the section The question of the schemas existence we have discussed the schema issue. Document databases are well suited to accommodate variations in documents within a collection. Because any new document could be a little bit different from all other document stored so far, it does not make sense to require that all possible document fields should be specified prior to building and populating the database.

    Although it is true that we do not have to define a schema prior to adding documents, there are some practical consideration related with this. Any type of schema or constrains force us to rethink about something. In database context it allows us to look again into our data model and ask ourselves if it would play or not. Living totally without any schema leads to more and more complex structures that change beyond our control. In some sense, structures starts to live its own life. Because we have polymorphism at document level, so documents related to a common category of information may be a little different. Over time, we accept the fact that these differences are more and more significant. At the end, we find out that we're writing an extensive code just to cover completely differently structured documents that belong to the same category.

    So schema-less means freedom from the need to predefine the database structure ahead but shouldn't be understand as a permission to mindless model expansion.


    More flexibility by a price of more responsibility

    Flexibility understandable as freedom from the need to predefine formally the database structure prior to adding documents no stays without any consequences. Because there is no any method to constrain what we do, or want to do with our database, this control responsibility is transferred to the programmer. It may be the case that developers must write huge code to maintain data correctness and to avoid possible anomalies in a documents store. This is our (programmer) responsibility and there is no any support from database side in this aspect.

    On the other hand, taking care about data seems to be the most resources consuming task whether we say about code have to be written, time have to be spent or any other technical aspect. An undeniable profit is that we will have less need for database tuning and query optimization in the future. Overall, in terms of programmer productivity and data accessibility, we should gain more than lose.


    Data denormalization

    When you think it is good, denormalize your data. Do it as often as you need but not more often. Denormalization, like all good things, can be used in excess. Our goal is to keep data that is frequently used together in the document. This allows the document database to minimize the number of times it must read from persistent storage. Too much denormalization leads to huge and impractical documents.



    Working example, 1.1: CouchDB basics



    Installation

    1. To install CouchDB on Linux we can follow Installation on Unix-like systems
      According to 1.1.1.1. Enabling the Apache CouchDB package repository from 1.1.1. Installation using the Apache CouchDB convenience binary packages the following command should be executed first


      $ echo "deb https://apache.bintray.com/couchdb-deb bionic main" \
      | sudo tee -a /etc/apt/sources.list

      According to 1.1.1.2. Installing the Apache CouchDB packages from 1.1.1. Installation using the Apache CouchDB convenience binary packages

      $ curl -L https://couchdb.apache.org/repo/bintray-pubkey.asc \
      | sudo apt-key add -
      [... CUT ...]
      $ sudo apt-get update && sudo apt-get install couchdb
      [... CUT ...]

      During instalation we will be asked about some details (as an admin pasword on the stage Password for the CouchDB "admin" user I entered admin so in my case: user:password: admin:admin).

      Now, according to final sentence in this section: Relax! CouchDB is installed and running.
    2. What we can do next, is not strictly according to step 1.1.6. First Run from manual. After installation, open built-in web interface of CouchDB by visiting the following link: http://127.0.0.1:5984/. If everything goes fine, this will give us a Fauxton, which is the web interface of CouchDB, having the following output.


      or the same from command line (results are presented as reformatted text)

      nosql@couchdb:~$ curl http://127.0.0.1


      Generally speaking we can operate with CouchDB either with web interface or command line. In the following part mostly a command line will be used. Web interface is very intuitive and for those who know command line shouldn't be a great problem to discover on their own and work with it.
    3. Verify installation by visiting Fauxton at http://127.0.0.1:5984/_utils/index.html

      and selecting Verify CouchDB Installation option from left-bar menu


    Create database, add user

    1. List of all databases

      nosql@couchdb:~$ curl -X GET http://127.0.0.1:5984/_all_dbs


      As we can see, there are three databases and one of them (test_suite_db) is a result of self-test CouchDB procedure we have initiated in Installation step. We can delete it with

      nosql@couchdb:~$ curl -X DELETE http://127.0.0.1:5984/test_suite_db



      nosql@couchdb:~$ curl -X GET http://127.0.0.1:5984/_all_dbs

    2. Create a database

      nosql@couchdb:~$ curl -X PUT http://127.0.0.1:5984/test_database_01


      As we can see, at this point we are not a server admin, so we don't have authorization to create a new database. More details can be found at 1.6. Security. Let's fix this...
    3. Use admin account to create a database

      nosql@couchdb:~$ curl -X PUT http://admin:admin@127.0.0.1:5984/test_database_01



      nosql@couchdb:~$ curl -X GET http://127.0.0.1:5984/_all_dbs

    4. Add user (user:password: user01:password01)


      nosql@couchdb:~$ curl -X PUT http://localhost:5984/_users/org.couchdb.user:user01 \
      > -H "Accept: application/json" \
      > -H "Content-Type: application/json" \
      > -d '{"name": "user01", "password": "password01", "roles": [], "type": "user"}'

    5. The user user01 should now exists in our database, because the document related with him was successfully created. Let’s check if this is true:

      nosql@couchdb:~$ curl -X POST http://localhost:5984/_session -d 'name=user01&password=password01'


      nosql@couchdb:~$ curl -X GET http://admin:admin@localhost:5984/_users/org.couchdb.user:user01

    6. Restrict an access to databse. Now our database we have created previously is open for all queries

      nosql@couchdb:~$ curl http://localhost:5984/test_database_01

      To restrict it


      nosql@couchdb:~$ curl -X PUT http://localhost:5984/test_database_01/_security \
      > -u admin:admin \
      > -H "Content-Type: application/json" \
      > -d '{"admins": { "names": [], "roles": [] }, "members": { "names": ["user01"], "roles": [] } }'


      nosql@couchdb:~$ curl http://localhost:5984/test_database_01


      nosql@couchdb:~$ curl http://user01:password01@localhost:5984/test_database_01


      nosql@couchdb:~$ curl -u user01:password01 http://localhost:5984/test_database_01


      1. Create a document

        1. Add a first document

          nosql@couchdb:~$ curl -X PUT -u user01:password01 http://127.0.0.1:5984/test_database_01/"001"
          -d '{"InvoiceNumber" : "1"}'


          The response of CouchDB to request of this type contains three fields

          • ok, indicating the operation was successful.
          • id, which stores the id of the document.
          • rev, this indicates the revision id. Every time we revise (update or modify) a document a _rev value will be generated by CouchDB. If we want to update or delete a document, CouchDB expects us to include the _rev field of the revision we wish to change. When CouchDB accepts the change, it will generate a new revision number. This mechanism ensures concurrency control.

          Add second document


          nosql@couchdb:~$ curl -X PUT -u user01:password01 http://127.0.0.1:5984/test_database_01/"002"
          -d '{"InvoiceNumber" : "2"}'

        2. We can verify by getting the created document

          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/001


          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/002

        3. Because in document databases keys (main keys) do not have any meaning, we can try to use random keys -- typically UUID. CouchDB can generate UUIDs for us either one-by-one or any number we want (with count argument)

          nosql@couchdb:~$ curl -X GET http://127.0.0.1:5984/_uuids


          nosql@couchdb:~$ curl -X GET http://127.0.0.1:5984/_uuids?count=3

          Now we can use a first UUID


          nosql@couchdb:~$ curl -X PUT -u user01:password01 http://127.0.0.1:5984/test_database_01/
          00be5cce05f7edf44570dc44ed00045c -d '{"InvoiceNumber" : "3"}'


        Add new fields to a documents, update existing field

        1. Now our goal is to add some new fields to the documents id=001. Updating an existing document is the same as adding again the entire document. It is not possible to add a field to an existing document. We can only write an entirely new version of the document into the database with the same document ID. Remember to retrieve the revision id of the document that is to be updated.

          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/001



          nosql@couchdb:~$ curl -X PUT -u user01:password01 http://127.0.0.1:5984/test_database_01/001
          -d '{"_rev" : "1-3b6cb095dfa92ff1cf599bdeca2fcb7b", "CutomerDetails" : {}}'



          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/001


          Note that:

          • _rev was changed from 1-3b6cb095dfa92ff1cf599bdeca2fcb7b to 2-a93a9f93e0b535eeea4db046502bab38.
          • Information about an invoice number disappeared.
        2. So to have both invoice number and customer details we should now write

          nosql@couchdb:~$ curl -X PUT -u user01:password01 http://127.0.0.1:5984/test_database_01/001
          -d '{"_rev" : "2-a93a9f93e0b535eeea4db046502bab38", "InvoiceNumber":"1", "CutomerDetails" : {}}'



          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/001


        3. Again if we try to only update existing data (invoice number)

          nosql@couchdb:~$ curl -X PUT -u user01:password01 http://127.0.0.1:5984/test_database_01/001
          -d '{"_rev" : "3-b56d23516b65a3e78a287b47862f32f8", "InvoiceNumber":"111"}'



          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/001


          we will lost all our other information.


        Delete a documents

        1. Deleting document remember to include revision number, otherwise you will get misleading message: Document update conflict.

          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/001



          nosql@couchdb:~$ curl -X DELETE -u user01:password01 http://127.0.0.1:5984/test_database_01/001



          nosql@couchdb:~$ curl -X DELETE -u user01:password01
          http://127.0.0.1:5984/test_database_01/001?rev=4-1a1d2137cb82523a33ae10f86346750c



          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/001


          A tricky part of deleting is that CouchDB doesn’t actually delete documents. Instead, it leaves tombstone with very basic information about the document. If we just use GET CouchDB returns 404 Not Found response with a following JSON data {"error":"not_found","reason":"deleted"}. However, we may retrieve document by using rev query parameter with GET request:

          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/001



          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/001?
          rev=4-1a1d2137cb82523a33ae10f86346750c


        A document history

        1. Add to test_database_01 a new document with id revison_test.

          nosql@couchdb:~$ curl -X PUT -u user01:password01 http://127.0.0.1:5984/test_database_01/
          revision_history
        2. Apply a sequence of changes to this document

          nosql@couchdb:~$ curl -X PUT -u user01:password01 http://127.0.0.1:5984/test_database_01/
          revision_history -d '{"first_field" : "1"}'



          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/
          revision_history



          nosql@couchdb:~$ curl -X PUT -u user01:password01 http://127.0.0.1:5984/test_database_01/
          revision_history -d '{"_rev" : "1-0ac8d1aec7aeb58051ce4147ad427e1d", "first_field":"2"}'



          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/
          revision_history



          nosql@couchdb:~$ curl -X PUT -u user01:password01 http://127.0.0.1:5984/test_database_01/
          revision_history -d '{"_rev" : "2-6b40de3094d24e71533f48d5e545cbd8", "first_field":"2", "second_field":"1"}'



          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/
          revision_history



          nosql@couchdb:~$ curl -X PUT -u user01:password01 http://127.0.0.1:5984/test_database_01/
          revision_history -d '{"_rev" : "3-19f1d56c59f13f7bed3b0c71a71f89d5", "first_field":"2", "second_field":"2"}'



          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/
          revision_history



          nosql@couchdb:~$ curl -X PUT -u user01:password01 http://127.0.0.1:5984/test_database_01/
          revision_history -d '{"_rev" : "4-7c59492e3ecb726eb0ccede282c67816", "third_field":"1"}'



          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/
          revision_history

        3. We can obtain a list of the revisions for a given document by adding the revs=true parameter to the request URL:

          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/
          revision_history?revs=true


          The returned JSON structure includes the original document, including a _revisions document that includes the revision information in next form:

          • ids (array): array of valid revision IDs, in reverse order (latest first);
          • start (number): prefix number for the latest revision.
        4. We can get extended revision history for a given document by supplying the revs_info argument to the query:

          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/
          revision_history?revs_info=true


          The returned document contains _revs_info document with extended revision information, including the availability and status of each revision. This array field contains objects with following structure:

          • rev (string): full revision string.
          • status (string): status of the revision. Maybe one of:
            • available: revision is available for retrieving with rev query parameter;
            • missing: revision is not available;
            • deleted: revision belongs to deleted document.
        5. Now we can get any available revision we want

          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/
          revision_history?rev=1-0ac8d1aec7aeb58051ce4147ad427e1d



          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/
          revision_history?rev=2-6b40de3094d24e71533f48d5e545cbd8



          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/
          revision_history?rev=3-19f1d56c59f13f7bed3b0c71a71f89d5



          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/
          revision_history?rev=4-7c59492e3ecb726eb0ccede282c67816



          nosql@couchdb:~$ curl -X GET -u user01:password01 http://127.0.0.1:5984/test_database_01/
          revision_history?rev=5-28faf5526b2d5b0be53b139ae6e0c2d5


        Working example, 1.2: CouchDB basics - querying

        Querying CouchDB is not as simple as SELECT * FROM table_name relational pattern (at least when we do this a first time) so a separate section is devoted to this topic.

        In CouchDB there is a JavaScript view engine to help us create and run queries using the map-reduce paradigm. The map portion of the paradigm takes a list of values and applies the same computation to each value to produce a new refined list. The reduce operation then collapses or combines those values into a single or given number of values. For example, the total sum of all the values. The views which act as the map part of the paradigm, are defined using JavaScript functions that take a single CouchDB document as an argument. These functions determine which documents are made available through the views.

        Map-reduce paradigm is different than map-reduce used in Hadoop -- its closer to simple divide and conquer; see a map-reduce discussion at [W16, W17]


        Prepare test database

        Create a database to be used in this section. We will use the command explained in the previous section so in case of any problems / questions please refer to adequate part.


        nosql@couchdb:~$ curl -X PUT http://admin:admin@127.0.0.1:5984/test_query


        nosql@couchdb:~$ curl -X GET http://127.0.0.1:5984/_all_dbs

        Add user (user:password: user_query:query)


        nosql@couchdb:~$ curl -X PUT http://localhost:5984/_users/org.couchdb.user:user_query \
        -H "Accept: application/json" \
        -H "Content-Type: application/json" \
        -d '{"name": "user_query", "password": "query", "roles": [], "type": "user"}'

        Test if user_query exists


        nosql@couchdb:~$ curl -X POST http://localhost:5984/_session -d 'name=user_query&password=query'


        nosql@couchdb:~$ curl -X GET http://admin:admin@localhost:5984/_users/org.couchdb.user:user_query

        Restrict an access to a database


        nosql@couchdb:~$ curl -X PUT http://localhost:5984/test_query/_security \
        -u admin:admin \
        -H "Content-Type: application/json" \
        -d '{"admins": { "names": [], "roles": [] }, "members": { "names": ["user_query"], "roles": [] } }'


        nosql@couchdb:~$ curl http://localhost:5984/test_query


        nosql@couchdb:~$ curl -u user_query:query http://localhost:5984/test_query

        Add some data to our database. Get UUIDs


        nosql@couchdb:~$ curl -X GET http://127.0.0.1:5984/_uuids?count=10

        Use UUIDs to add documents


        nosql@couchdb:~$ curl -X PUT -u user_query:query http://127.0.0.1:5984/test_query/
        5fdee5f865c9a13e03a32aaf46000f72 -d '{"sequence" : "0,1,2,3,4,5,6,7,8,9"}'
        {"ok":true,"id":"5fdee5f865c9a13e03a32aaf46000f72","rev":"1-5c0efafd404c4968922bb1322fadbb97"}
        nosql@couchdb:~$ curl -X PUT -u user_query:query http://127.0.0.1:5984/test_query/
        5fdee5f865c9a13e03a32aaf46001024 -d '{"sequence" : "0,2,4,6,8"}'
        {"ok":true,"id":"5fdee5f865c9a13e03a32aaf46001024","rev":"1-edaa48a80dc5a1c1698e5a78e36635dc"}
        nosql@couchdb:~$ curl -X PUT -u user_query:query http://127.0.0.1:5984/test_query/
        5fdee5f865c9a13e03a32aaf46001130 -d '{"sequence" : "1,3,5,7,9"}'
        {"ok":true,"id":"5fdee5f865c9a13e03a32aaf46001130","rev":"1-acf466dabcbc023d234e56d0450879fd"}
        nosql@couchdb:~$ curl -X PUT -u user_query:query http://127.0.0.1:5984/test_query/
        5fdee5f865c9a13e03a32aaf46001941 -d '{"sequence" : "0,1,2,3,4"}'
        {"ok":true,"id":"5fdee5f865c9a13e03a32aaf46001941","rev":"1-eac26e7eade624d546be3d1f21a06668"}
        nosql@couchdb:~$ curl -X PUT -u user_query:query http://127.0.0.1:5984/test_query/
        5fdee5f865c9a13e03a32aaf46001b9a -d '{"sequence" : "5,6,7,8,9"}'
        {"ok":true,"id":"5fdee5f865c9a13e03a32aaf46001b9a","rev":"1-6e2edc9b19057fe147ce8ff54b2780ed"}
        nosql@couchdb:~$ curl -X PUT -u user_query:query http://127.0.0.1:5984/test_query/
        5fdee5f865c9a13e03a32aaf46002b47 -d '{"sequence" : "0,2,4,5,6,7"}'
        {"ok":true,"id":"5fdee5f865c9a13e03a32aaf46002b47","rev":"1-92f37ab93fb89901a3a2c0afb734afc1"}
        nosql@couchdb:~$ curl -X PUT -u user_query:query http://127.0.0.1:5984/test_query/
        5fdee5f865c9a13e03a32aaf46002e30 -d '{"sequence" : "4,5,6,7,8"}'
        {"ok":true,"id":"5fdee5f865c9a13e03a32aaf46002e30","rev":"1-dd47739eb8c60e86a49363bc8dd85d36"}
        nosql@couchdb:~$ curl -X PUT -u user_query:query http://127.0.0.1:5984/test_query/
        5fdee5f865c9a13e03a32aaf46003e0a -d '{"sequence" : "0,1,2,7,8,9"}'
        {"ok":true,"id":"5fdee5f865c9a13e03a32aaf46003e0a","rev":"1-905ffa5c8d4f193e451c411cd844099c"}
        nosql@couchdb:~$ curl -X PUT -u user_query:query http://127.0.0.1:5984/test_query/
        5fdee5f865c9a13e03a32aaf460047e7 -d '{"sequence" : "0,1,4,5,8,11"}'
        {"ok":true,"id":"5fdee5f865c9a13e03a32aaf460047e7","rev":"1-0d0301bf8d1ef3d4f38bf10d433dd8d4"}
        nosql@couchdb:~$ curl -X PUT -u user_query:query http://127.0.0.1:5984/test_query/
        5fdee5f865c9a13e03a32aaf4600531c -d '{"sequence" : "11,12"}'
        {"ok":true,"id":"5fdee5f865c9a13e03a32aaf4600531c","rev":"1-44c242f92eae744346cabcf8db7f92be"}


        A short introduction to CouchDB views

        In CouchDB, the documents contained in a database are accessed by view which acts as a window into our data. A view implements queries using the map-reduce paradigm. It consists of mapper and reducer functions that are used to generate, based on a single document, an ordered list of key-value pairs. Both keys and values can be any valid JSON.

        Views are the primary tool used for querying and reporting on CouchDB documents. There are two different kinds of views.

        • Permanent views Permanent views are stored inside special documents called design documents, and can be accessed via an HTTP GET request to the URI /<DBNAME>/<DOCID>/<VIEWNAME>, where <DOCID> has the prefix _design/ so that CouchDB recognizes the document as a design document, and <VIEWNAME> has the prefix _view/ so that CouchDB recognizes it as a view. Other words, the views can be retrieved at an URL that looks like this: /<DBNAME>/_design/<DESIGN_DOC>/_view/<VIEW_NAME>.
        • Temporary views Temporary views are not stored in the database, but rather executed on demand. To execute a temporary view, we make an HTTP POST request to the URI /<DBNAME>/_temp_view, where the body of the request contains the code of the view function and the Content-Type header is set to application/json.

        Bellow there are some notes related to how views are managed and executed.

        • Document are processed by views independently -- each document independently of all other.
        • Map functions can’t modify the document, and they can’t interact with the outside world (they can’t have side effects).
        • Views (permanent views) are static in a sense that once computed for a given document, called next time return previously computed value (if only document stays unchanged). When we query the database using a view for the first time, each record in the database is run through the map function. This only happens once when the view is first run so that the database can be indexed and stored in a B-tree.
        • Any production code should not rely on temporary views as they are very expensive to compute each time they get called and they get increasingly slower the more data we have in a database.
        • By default views are not created and updated when a document is saved, but rather, when they are accessed. As a result, the first access might take some time depending on the size of our data while CouchDB creates the view.

          When we say create a view very often it means not creating a view (putting its code) but rather executing existing view on existing set of data. Using views is a two-step process. First we define a view; then we query it.
        • All views in a single design document get updated when one of the views in that design document gets queried.

        The simplest view, provided out of the box for all databases, is called _all_docs. It contains an entry for each document in the database, keyed by its string _id.

        To retrieve all the documents in the database, issue a following GET request


        nosql@couchdb:~$ curl -X GET -u user_query:query http://127.0.0.1:5984/test_query/_all_docs

        The response is a JSON object that contains an array of rows.

        Each row is an object with three fields:

        • id is the document’s _id.
        • key is the JSON key produced by the mapreduce functions.
        • value is the associated JSON value, also produced through mapreduce.

        In the case of _all_docs, the id and key fields match, but for custom views this will almost never be the case. By default, views won’t include all of each document’s content in the value returned. To retrieve all of the document’s fields, add the include_docs=true URL parameter.


        nosql@couchdb:~$ curl -X GET -u user_query:query http://127.0.0.1:5984/test_query/_all_docs
        ?include_docs=true


        Defining and querying a view


        Map function

        Defining a view (permanent view) is done by creating a special document, called design document, in a CouchDB database. The _id of this document must start with _design/. Other than that, is just a regular CouchDB document. Another requirement is to prepare the contents of the design document in a special format. Here is an example:

        This way we are defining a view viewname. The definition of the view consists of two functions: the map function and the reduce function. The latter function, a reduce function, is optional.

        A single design document can also include multiple view definitions, each identified by a unique name

        As it was stated before, to query the view viewname, we perform an HTTP GET request to the following URI:

        A view function accept a single argument: the document object. To produce results, it should call the implicitly available emit(key, value) function. For every invocation of that function, a result row is added to the view (if neither the key nor the value are undefined). emit shouldn't be identified with, known from many programming languages, return statement. A given map function may call emit one time, many times, or no times for a given document.

        The simplest map function may look like this:

        To execute this code as a temporary view we may send a POST request to the _temp_view handler. In this case, we pass in our map function as a JSON object in the request body.


        nosql@couchdb:~$ curl -X POST -u user_query:query http://127.0.0.1:5984/test_query/_temp_view \
        > -H "Content-Type: application/json" \
        > -d '{"map":"function(doc){emit(null,doc);}"}'

        We will fix it in a minute. Let's now use admin account.


        nosql@couchdb:~$ curl -X POST -u admin:admin http://127.0.0.1:5984/test_query/_temp_view \
        > -H "Content-Type: application/json" \
        > -d '{"map":"function(doc){emit(null,doc);}"}'

        Ok, it works, so let's change user_query privileges.


        nosql@couchdb:~$ curl -X PUT http://localhost:5984/test_query/_security \
        > -u admin:admin \
        > -H "Content-Type: application/json" \
        > -d '{"admins": { "names": ["user_query"], "roles": [] }, "members": { "names": [""], "roles": [] } }'

        Now we can query


        nosql@couchdb:~$ curl -X POST -u user_query:query http://127.0.0.1:5984/test_query/_temp_view \
        > -H "Content-Type: application/json" \
        > -d '{"map":"function(doc){emit(null,doc);}"}'

        To save our view code as a design document we may

        • Use direct HTTP request which is fine when code is short.
          Add view code


          nosql@couchdb:~$ curl -X PUT -u user_query:query http://127.0.0.1:5984/test_query/_design/queries \
          > -H "Content-Type: application/json" \
          > -d '{"views": {"getall": {"map": "function(doc){emit(null,doc);}"}}}'


          Verify

          nosql@couchdb:~$ curl -X GET -u user_query:query http://127.0.0.1:5984/test_query/_design/queries


          Test

          nosql@couchdb:~$ curl -X GET -u user_query:query http://127.0.0.1:5984/test_query/_design/que
          ries/_view/getall

        • Upload code from a file
          Prepare a file in any text editor we like (in this simple case bash echo command is used)

          Unfortunately when editor is used, final JavaScript code has to be transformed into ugly and unreadable oneline string. In other cases you will see

          Instead of

          we have to make file contents as below


          nosql@couchdb:~$ echo '{"views": {"getall2": {"map": "function(doc){emit(null,
          doc);}"}}}' > view_code.js
          nosql@couchdb:~$ cat view_code.js
          {"views": {"getall2": {"map": "function(doc){emit(null,doc);}"}}}

          Upload it to CouchDB, and save it as design document to create the view. Note that in our case rev was used to prevent document update conflict


          nosql@couchdb:~$ curl -X PUT -u user_query:query http://127.0.0.1:5984/test_query/_design/que
          ries -d @view_code.js
          {"error":"conflict","reason":"Document update conflict."}
          nosql@couchdb:~$ curl -X PUT -u user_query:query http://127.0.0.1:5984/test_query/_design/que
          ries?rev=1-41627f2abce159e7bf77199fa165053f -d @view_code.js
          {"ok":true,"id":"_design/queries","rev":"2-f485fe127204b857cdd0e874acf946d8"}

          Verify

          nosql@couchdb:~$ curl -X GET -u user_query:query http://127.0.0.1:5984/test_query/_design/queries

          Test


          nosql@couchdb:~$ curl -X GET -u user_query:query http://127.0.0.1:5984/test_query/_design/que
          ries/_view/getall

          As we can see, previously defined getall view has been removed which is correct as in CoucgDB every update creates a new document.


          nosql@couchdb:~$ curl -X GET -u user_query:query http://127.0.0.1:5984/test_query/_design/que
          ries/_view/getall2


        Reduce function

        Reduce function operate on the output of the map function (also called the map result or intermediate result). As its name states, a reduce function reduces the input values (the list that the map function produces) to a smaller output value to get finally a single value. If a view has a reduce function, it is used to produce aggregate results for that view.

        Here is an example of a reduce function:

        Reduce function takes three arguments: key, values, and rereduce. The first argument, key, is an array of tuples—two element arrays containing the key emitted by the mapper and the _id of the document that produced it. The second argument, values, is an array of values corresponding to the keys. The third argument, rereduce, is a boolean value that will be true if this invocation is a rereduction. That is, rather than being sent keys and values that were emitted from mapper calls, this call is sent the products of previous reducer calls. In this case, the key parameter will be null.

        That is why, reduce functions must handle two cases

        • When rereduce is false, i.e.

          • key will be an array whose elements are arrays of the form [key,id], where key is a key emitted by the map function and id is that of the document from which the key was generated.
          • values will be an array of the values emitted for the respective elements in keys.
        • When rereduce is true

          • key will be null.
          • values will be an array of values returned by previous calls to the reduce function.

        Reduce functions should return a single value, suitable for both the value field of the final view and as a member of the values array passed to the reduce function.

        Good introduction to views can be found at [W18]

        In our case we may want to count the number of 1's we have in all of our documents (see also Built-In Reduce Functions). The following code will do the job

        and the same in "ugly" form

        When uploaded


        nosql@couchdb:~$ curl -X PUT -u user_query:query http://127.0.0.1:5984/test_query/_design/que
        ries?rev=3-f70cdf0e279392acca16723e388aa6a8 -d @view_code.js

        and tested


        nosql@couchdb:~$ curl -X GET -u user_query:query http://127.0.0.1:5984/test_query/_design/queries/_view/count

        returns result as expected.