At one time, when we didn’t have much data, most of what we did have was considered either essential, or very valuable indeed: Accounts, legal documents, receipts, orders, medical records – you get the picture. Because we couldn’t generate, store or process much information, that which we did generate, store and process had real importance: Not only was it important that the data was retained, it was important that it was consistent for all viewers of that data at any point in time and it was important that it didn’t get lost.
As computers advanced and offered the automation of data processing, much of early computing focused on this automation of manual data processing.
Then came the internet.
And with it came a vast amount of new data, much of it seemingly trivial. This seemingly trivial data falls into the Realm of Dispensable Data. The Oxford English dictionary defines “dispensable” with several meanings, the most salient to this article being “That can be dispensed with or done without; unessential, omissible; unimportant.”
It struck me that much of the driving force behind behind many Big Data projects built upon NOSQL document stores is the desire to harness dispensable data and drive value from it. So what exactly constitutes “Dispensable Data”?
In my opinion, data is dispensable when it meets all of the following criteria:
- Not required for audit/forensic purposes.
- Not essential to the delivery of a service/transaction/function.
- Unit of data is so small such that its loss would not significantly alter any aggregate of which it forms a part.
Looking at these three criteria, the answer to the three criteria may differ depending upon the context of who’s asking the question and the reason they’re asking.
Enough of the theory, here are some examples:
- Internet search engines – The results of any search engine fully meet this criteria: An audit of what the results are for a given search term is not required; If a particular result was omitted from the results, it is unlikely that the user would notice as he/she would be unaware of what the full list should be; If one entry were missing from the list of search results (there being a lot of results), it is unlikely the count of results would differ by a noticeable number.
- Tweets – If the client is a business analysing Tweets and looking for trending subjects for marketing purposes, it may need to store the Tweet data for subsequent analysis, but is it essential that a Tweet does not get lost? Unlikely. Being non-transactional, the second criteria is also met. And certainly, an individual Tweet would be so small in relation to all Tweets, that the loss of a single Tweet is highly unlikely to affect any aggregates of which it forms a part.
- Webserver state data – State information is only required for the duration of the user session. The only user of the state information for an individual web session is the end-user himself. Therefore retention beyond the session isn’t required, meeting the first criterion of auditing not being necessary. Should the session data be lost, the impact would only be to the user whose state is lost and that user would have to initiate another session. The service wouldn’t fail, only the session, so although the impact on the user concerned might be that he needs to re-enter data, the service as a whole would not otherwise be impacted. Session data is also not normally of use in any aggregates, thereby meeting the final criterion.
As you can see from the examples above, the advent of the internet and, in particular, the advent of social media has introduced a new category of dispensable data.
Dispensable data is the sweet-spot for many of the NOSQL document stores, many of which don’t offer transactional consistency. Instead, they offer what the vendors of these products call “eventually consistent” models. Furthermore, because much of this data is free-form text (blogs, tweets, etc), there is very little processing that can be usefully applied to this data. This also suits document stores.
Unfortunately, there will be many a keen developer who in his/her rush to embrace these technologies, will choose a document store for storage of data that is not dispensable; when a non-transactional data-store is used where transactions are required, it is only a matter of time before real money will be lost.