Home

Latest Post

The NOSQL Landscape

There are a lot of products today marketing themselves as NOSQL. However the more one digs deeply into the supported features of each, the more startling are the differences. Unlike RDBMSs, the feature set of NOSQL databases is radically different from one to another: In areas such as transaction support and scope, programmatic interfaces and underlying storage, the differences are so drastic that picking the wrong product can spell curtains for your project. So with that said, here’s my technical breakdown of the NOSQL landscape. Wikipedia categorises NOSQL platforms into four distinct types (Columnar, Document, Key-Value and Multi-Model):

Columnar

Columnar databases are primarily designed for the storage of structured, tabular data. Unlike in a conventional RDBMS, instead of the data being stored as rows on the underlying storage media, data in one (or more) columns are stored together on the underlying media. The advantage of columnar storage is mainly around compression – by storing the data in a column contiguously, higher compression ratios are possible on that data than is the case if the same tabular data were stored in rows. This is because different and optimal compression algorithms can be chosen for each column/group of columns stored together, taking into account both the underlying data type and the statistical distribution of that data. For example, if a column contains a large amount of duplicated data, then algorithms that include run-length encoding can be used to offer massive compression. Because high compression ratios can be achieved, queries that would require large amounts of IO require less physical IO at the cost of increased CPU utilisation in decompressing the required data during query execution. But it’s not all good news – columnar data structures are more difficult both to update and to delete data from as the compression has to be recalculated upon modification. This means that columnar storage is highly suited to immutable, historic data used for analytical purposes, but unsuitable for the storage of transactional data.

I have to differ with the Wikipedia entry for Columnar systems here – some columnar storage products are federated systems with processing distributed among multiple nodes, but others aren’t. Does that make them NOSQL? Or is it that they have to support an interface other than SQL? In that case, does MDX fit that definition?

Examples of columnar databases include HP Vertica (pure columnar), SAP HANA (In-memory columnar), Microsoft Analysis Services Tabular Mode (In-memory columnar).

Document

Document databases are primarily designed for the storage of semi-structured and unstructured data. The unit of storage is the document, analogous to a file but stored within the structure of the document database. What the document database provides over a stand-alone file-system is the ability to create indexes on the documents’ contents and a query language that facilitates the retrieval of documents according to query criteria. Furthermore because document databases are often distributed over multiple machines (i.e. sharded), this capability enables them to scale horizontally. Because the document store enables the storage of any document, it is essentially schema-less. This enables the storage of a wide variety of documents without, on the face of it, any necessity to modify the document store. However the reality is that in order to get the best query performance from a document store, it will be necessary to assess any index definitions to ensure that attributes used for query criteria are indexed. Similarly if a document store is indexing data that is not used for query criteria, the maintenance of these indexes will place an unnecessary burden on the underlying document store. Document stores offer varying features, for example MarkLogic supports transactions spanning multiple documents whereas MongoDB doesn’t. Other features commonly associated with RDBMSs are also likely to be missing in document stores, for example referential integrity, rules, triggers, etc. This means that such functionality, where required would have to be implemented outside of the database layer by the development team. A further consideration when choosing a document store is the underlying storage format, for example, MarkLogic is primarily designed for the storage of XML documents; MongoDB by contrast is primarily designed for the storage of JSON documents. Therefore if you want to avoid having to code the translation of your documents to/from the format of the underlying document store, ensure you choose a document store that can store your document format. Because there are limited features available for the enforcement of rules in the underlying document store, rules tend to be enforced by the application layer. Furthermore, because the document stores are schema-less, if a new version of the application requires a change to the schema, then either all the existing data has to be transformed to the new schema, or the code has to handle data both in the old format and the new. This latter approach can lead to complex, difficult to maintain code in the long term. Document stores are suitable for the storage of varying semi-structured data where transactional consistency is of low importance, for example blog storage.

Examples of document databases include MongoDB, MarkLogic, Elastic, Apache CouchDB, Microsoft Azure DocumentDB

Key-Value

Key-value stores vary widely in their structure and capabilities. There are a large range of products that offer key-value structures. What they have in common is that they are mostly non-relational and as such necessitate flat, denormalized data structures accessible by a key, that may or may not be ordered. Since Key-Value stores are non-relational, the data stored should be simple, tabular data, without too much repeated data. This makes Key-Value stores ideal for storage of large amounts of tabular, logging data, which is commonly simple and immutable.

Examples of Key-Value stores include Apache Cassandra, Apache HBase, Apache Accumulo, Google BigQuery

Graph

Graph stores are designed for the storage of complex network structures, where the relationships between the nodes stored are complex, variable and cannot be fully defined up-front. Examples of ideal usage of graph stores would be tracking the relationships and interactions between people, posts and messages on a social networking site, or tracking the interaction between individuals, places and resources in a database storing information on criminal activity. Graph databases are particularly suited to queries requiring what would otherwise be considered iterative queries, such as the expansion of a network.

Examples of Graph databases include Neo4j, Teradata Aster SQL-GR, Microsoft Horton, VelocityGraph, MapGraph, Ontotext GraphDB, BrightStarDB, InfiniteGraph

Multi-Model

Multi-model stores combine two or more of the above structures (and possibly RDBMS structures) in a single store enabling querying across the store types in the same database engine.

Examples of Multi-model database engines include Teradata Aster Database (Row, Columnar), Teradata Database (Row, Columnar), OrientDB (Graph-Document), Apache Hive (Document and Columnar), Oracle Spatial and Graph (Graph, Spatial)

%d bloggers like this: