Sometimes the phrase “dark data” is used in a narrow sense to describe data that an organisation has collected but which has not yet been used to gain understanding or insight. The data might have been used for an immediate operational purpose (e.g. adding up the total cost of items in a shopping basket to decide how much to charge the customer; or noting how many steps you have taken today), and then been automatically stored in a database, without being subjected to any deeper analysis. Automatic sensor data, of which a great deal is collected and most is not subsequently analysed, is a prime example of this form of dark data. Another example is data which might need to be retained for possible future audit operations, such as by tax authorities, but might otherwise not be looked at.

Such data are dark because, although they exist in the database, they are unexamined. It’s possible that they contain information which would change your understanding or alter the way you run your organisation. Donald Rumsfeld famously distinguished between known unknowns, and unknown unknowns, the first two DD-types described in my book. He also mentioned unknown unknowns, but perhaps the form of dark data consisting of data you have but have not looked at constitutes unknown knowns. It’s rather nice to complete the quartet!

There has been a rise in interest in the potential of such data, based on the promise of data mining, the discipline of finding unexpected, novel, and valuable or interesting information in large data sets. The questions arising from this particular kind of dark data, data which are available in principle but which you have not examined, is the same as with all variants of dark data: what are you missing? have you misunderstood something because you have not seen all of the relevant data? are you going astray because you have an incomplete picture? and so on.

A cautionary note is also appropriate. While it is certainly possible that large unexamined collections of data might contain something valuable, even critical for the effective running of an organisation, they may not. Enthusiasm for this kind of exercise is sometimes based on the misunderstanding captured in the Manure Heap Theorem. This says that the probability of finding a gold coin in a heap of manure tends towards 1 as the size of the heap tends to infinity. And this theorem is clearly false. Size does not guarantee value, and unexamined data do not necessarily contain anything of value or relevance.