Understanding NoSQL Database Types: Document

I've been covering different aspects of NoSQL in my running series here, which is targeted at beginners who want an entry-level view into databases. Also known as "non-SQL" or "non-relational", this category of database allows for the storage and querying of "schema-free" data — ie. data that is designed through a means other than the tabular model of relational databases.

Indeed — NoSQLs came around in the early 1990s to support features that were inherently difficult or unavailable via SQL's model, such as horizontal scalability. This need became radically greater with the internet's maturation; when managing unstructured/semi-structured data at massive volumes was pivotal — although there is debate at deeper levels on whether SQL vs. NoSQL is a valid line of demarcation regarding DevOps generally.

The four most common NoSQL database systems are: 1) keyvalue 2) document 3) graph 4) column. (But there are many more subsets, with 7 well-known ones.) For more on my CACM 'NoSQL series', you can visit these links:

Definition: Document Databases

Document-oriented databases store data as JSON rather than using columns/rows. It is a native form of data storage, of sorts, due to how it is widely used by the popular NoSQL and SQL systems.

This NoSQL database type is considered one of the four essential ones — so, no surprise, document databases are used by most popular non-relationals such as MongoDB, Cosmos DB, DocumentDB, Elasticsearch, OrientDB, PostgreSQL (which is foremostly SQL), RavenDB and SimpleDB.

Documents in and of themselves are at the root of document-oriented databases. How a document is defined will vary depending on the specific data store implemented. Documents are, however, typically used for encapsulating data into a standard encoding format. Common formats are XML, YAML, JSON, and the binary form, BSON.

Documents are roughly synonymous with objects, in their function. Organizationally, there is no need for a set schema — for instance, the fields inside each document are optional. Each document store can also hold different types of documents, often allowing you to encode separate documents via separate systems. Here is an example of a document being encoded using JSON:

{

"FirstName": "Alex",

"Address": "9 King's Cross",

"Hobby": "databases"

}

Let's imagine the same document, but encoded via XML:

<lastname>Williams</lastname>

<type>Mobile</type>

<street1>9 King's Cross</street1>

<city>London</city>

</address>

</contact>

The information is the same, but fields can differ. When you add new information to documents, there is no need to update every document's record, or to make sure they share a similar structure. (One difference between SQL and NoSQL is that relationals allow you to have some entries with no data entered inside fields, whereas documents can contain no empty fields.)

The way in which each document is structurally composed is usually referred to as that document's content, which can be referenced during querying/editing. For scalability, you can restrict updates to new entries rather than for the whole database. When making an add, change, query, or delete of information, you will be using a unique ID. Although each document has a unique identifier, you can simply use a number series or the complete pathway to reference documents. During queries, each document itself is searched — data is directly extracted from documents, rather than columns inside your database.

Advantages

Schema free — you are relatively unrestricted in how you structure/format documents; good for managing massive data volumes in varying structural states, particularly in environments that anticipate ongoing and rapid transformations of varied data.
No foreign keys — documents can exist independently, without the need for relationships, with cleaner builds via open formats like JSON and XML for describing documents.
Low maintenance / high speeds — simply add your complex document once, with minimal maintenance thereafter; there is also a built-in versioning, which means fewer conflicts as your documents grow in size/complexity.

Disadvantages

1. Limits of consistency checking

Because documents are not forced to have relations with one another, and each can have varying fields, this reduces consistency checks. For instance, in creating a book database containing author collections (with each collection containing associated documents/books) — it would be possible to bring up entries not associated with any author. There could also be duplications of author information.

This isn't an immense drawback in consistency in most contexts wherein you would use a document database, but for the highest level of RDB consistency audits (such as for accounting), it's anathema.

2. Weaker atomicity

Document databases are able to achieve eventual consistency. In the relational model, data can be modified from one place, without needing to use JOINs. A single command (such as deleting or updating a row) is able to affect all new reading queries, which will inherit this change.

In comparison, to make changes to collections of documents will need you to run separate queries per collection — which is a violation of atomicity requirements.

3. Security weaknesses

Nearly half of industries today — including the public services, manufacturing, healthcare, education, retail and utilities sectors — are exposed throughout the year, due to at least one serious exploitable web app vulnerability (Source: SecMag).

Popular Use Cases

Let's run over a few great document database use cases examples, to place document databases into a concrete context. Overall — the commonalities of use, from Coca-Cola to eBay, is in giving more flexibility and cost-effective, feature-rich capabilities to developers — thereby growing the availability/performance of apps:

MongoDB:

Forbes used Mongo to reduce its build time by 58%, leading to a 28% increase in subscriptions. Forbes put this down to more ability to add new features; with less complex, smoother handling of data in varied structural states.
Toyota's developers could work at faster speeds via the more natural JSON document encoding format; spending more time working on development than modelling data.

Amazon DocumentDB:

Amazon itself used Amazon DocumentDB for its full development team, to increase productivity and agility. Standout features were ad hoc queries, aggregations, and nested indexes — all inside of a fully managed database.
The BBC used DocumentDB to store/query its data existing inside of multiple feeds, compiling them into individual customer feeds. Standout features that enabled this were a fully managed database service, with default backups, and high availability/durability.

ArangoDB:

Oxford University lowered hospital costs and the need for attendance, also improving test results — by developing a web app interface for remotely assessing cardiopulmonary disease.
FlightStats was able to gather divergent data (weather, airport delays, flight status, and reference data) into a single standard — which boosted the accuracy/availability of predictive meta-analytics via queries.

Conclusion

Document databases are particularly useful in app development. With the Apple App Store and Google Play Store containing a combined total of 33.6 billion app downloads, in Q1 of 2020 (Source: techjury) — document stores are/will continue to be a critical use case need for the majority of large businesses in most, if not all, industries.

With this increased demand, there's been a spike in the popularity of document stores on the market. And, while the main weakness of document stores is a relative lack of strong relational networks, this becomes its strength when dealing with large data volumes flexibly.

Alex Williams is a full-stack developer with over 15 years of experience, and the owner of Hosting Data UK.

Understanding NoSQL Database Types: Document

Definition: Document Databases

Advantages

Disadvantages

Popular Use Cases

Conclusion

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Definition: Document Databases

Advantages

Disadvantages

Popular Use Cases

Conclusion

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.