Ideally, whenever we have to make a decision we collect all the relevant data and condense it down so that we can make the best decision. However, the notion of “all the relevant data” is an elusive one, and very often, perhaps most often, much of the relevant data cannot be collected so that we are forced to base our decisions on partial information. The data we cannot use to inform our decisions are dark data.

Dark data comes in many forms, including blanks in a data table, cases we are unaware exist, unexamined data gathering metaphorical dust in our databases, and data we are unable to link to data we have. My book Dark Data: Why What You Don’t Know Matters describes fifteen types of dark data, and is certainly not comprehensive. The Covid-19 outbreak provides examples of several kinds.

Perhaps the starting point for the dark data perspective on the Covid pandemic is the question “could we have seen it coming?”. The answer, as is always the case with such situations (including the 2008 financial crash and the 2001 World Trade Centre atrocity), is that some people did. There is always a diversity of perspectives and expectations about the path the future will take. In particular, given the number of teams around the world working on epidemic modelling, and based on the ideas described in the book The Improbability Principle, there are bound to be some groups which predicted events of just the kind we are experiencing. Given this, perhaps the question should be why more attention was not paid to those who predicted such a pandemic. Were they unrealistic predictions – after all, there are always doomsayers and conspiracy theorists with extreme views. Or were they based on solid data, as, for example, [1], who, last October “convened a group of experts to work through what would happen if a global pandemic suddenly hit the world’s population. The disease at the heart of our scenario was a novel and highly transmissible coronavirus.” In any case, even if a prediction is based on solid data, it is unrealistic to expect governments to respond to all such warnings. After all, there are many other warnings being made, ranging from the impact of climate change, through plastics pollution and asteroid collision, to terrorism, refugee crises, and nuclear war. Governments have to balance probabilities and consequences, and all within the context of uncertainty in the way humans will respond as circumstances change.

On a smaller scale, perhaps the most obvious kind of dark data in the present context is lack of widespread testing to determine who is infected. From the operational perspective, test results are needed to decide who to quarantine (and treat) so as to curb the spread of the virus. On the other hand, from the statistical perspective, understanding how many people have contracted it and how it is spreading is vital if effective control policies are to be introduced. This means that random testing can be a useful tool from the statistical perspective, but not from the operational perspective, which will be focused on contact tracing and testing front-line workers. Failure to gain a good estimate of the number of infected cases can adversely impact the quality of the statistical models. “Adversely impact” can mean that it is difficult to know which of several models is a more accurate representation of the underlying truth. And a further complication, another kind of dark data, is that a simple single negative test is not the end of it. One can have a negative result and contract the illness immediately afterwards. Repeated testing is necessary.

Another, and a more subtle kind of dark data arises because tests are not perfect. Tests can report a positive identification on people who do not have the disease, and, conversely can report a negative identification on people who do have it. If a test is poor enough, then such false positives and false negatives can give a misleading impression of the disease prevalence. Worse, of course, from an operational perspective such errors can fail to slow down an outbreak as they release infected persons into the community. These two kinds of errors must be balanced against each other. At an extreme, it is easy to reduce the number of false negatives to zero by classifying everyone as positive, but this is clearly useless. The choice of a suitable balance has been the subject of a great deal of research, but the bottom line is that the balance depends on non-statistical factors such as the relative seriousness with which one regards the two kinds of misclassification. In the case of coronavirus, misclassifying someone as having the disease when they do not is less serious than the converse: the former means a quarantine period, while the latter could have adverse consequences for the individual and wider society.

False positive and false negative rates are all very well, but they are not really what we want to know. What we really want to know is what proportion of those classified as cases are in fact not cases, and what proportion of those classified as non-cases are in fact cases. These can be calculated from the false positive/negative rates, but require that the prevalence in the population is known, and this changes over time.

These problems mean that it is sometimes better to rely on less extensive but more reliable figures: death rates are generally more reliable than infection rates, since false positives and negatives are rare. Be warned, however, it is the essence of dark data that its presence (absence?) is not always obvious. Coronavirus death rates can undercount the number of deaths if some are classified as due to other causes (e.g. a heart attack, which was in fact precipitated by the virus). Likewise there have been suggestions that the death rate in Wuhan is an undercount: a report in The Times of 30th March said the number of urns for ashes sold and returned to family members far exceeded the official death toll. (Coronavirus around the world: New bailouts and restrictions as death toll tops 30,000).

Scenario planning and simulation can play a major role in understanding disease outbreaks and their progression. This is a positive use of dark data, generating data which might have been, on the assumption that the model encoded in the simulation is correct.

A related use of dark data “which might have been” arises in sensitivity analysis, where either models or data are perturbed slightly to see how large an effect is produced. One of the challenges with pandemics is that due to the exponential growth of cases, in which a compounding effect occurs, later numbers of cases depend in a highly sensitive way on slight changes to earlier values. In fact this is what underlies the social distancing strategy: reducing the number of contacts each infected individual has can dramatically reduce the numbers of cases further down the line. In the present context, this can allow health care facilities to cope, whereas otherwise they would be swamped.

A more pessimistic but similar use of perturbation, generating (dark) data which might have been, is to explore extreme situations.

A particular dark data challenge in managing epidemics is common to other areas of public policy. This is that there is always a delayed response to a change. Adopt one public policy now and you have to wait (in the present case around a couple of weeks) to see what effect it has. Until that point, the data are dark. If the data, when they become visible, show that the policy is not having the desired effect, then alternatives will be adopted. In the present case, the exponential/compounding nature of the outbreak means that the evidence that a policy is not working will suddenly become very apparent, with large numbers of cases (or deaths) appearing, whereas it was impossible to discern whether the policy was working before. This can lead to unjustified criticisms, such as “it is blindingly obvious that the policy is not working, so why did you not detect this earlier?” The answer is that while it might be “blindingly obvious” now, it was not obvious, or perhaps even detectable, earlier.

In this context, one must always bear in mind the contingent nature of science. One thing which is fundamental to science is that it is evidence-based and if the evidence, the data, changes (that is, more, hitherto dark, data become visible), then the scientific models and conclusions, as well as decisions and actions based on them, can change. This is something which media pundits sometimes forget. Unlike religion, science does not claim to provide absolute truth, but rather merely to provide a good model or explanation in the light of all current data. This means, of course, that it can be wrong. Put another way, we should never lose track of the fact that scientific models are just that – models. They are not the reality, and could indeed misrepresent that reality, though the more data that the models explain the more justification we have in treating those models as if they are a good representation of reality.

One consequence of this is that decisions based on the data available at the time the decisions have to be made cannot be faulted if they were sensible in view of that data, even if in retrospect they turn out to be wrong. On the other hand, if they were clearly mistaken decisions given the data on which they were based, then criticism might be legitimate.

All of this means that the sentiment expressed in the quote in The Sunday Times of 29th March (Tim Shipman: Inside No 10 – sickness, fear and now isolation) “from a source” which said “The big issue is who to blame for the fact that the strategy flipped” is unforgivable. The strategy flipped because more evidence became available and it became clear that the best thing to do was change direction. Not flipping the strategy in the face of new evidence that a change of direction was needed is what should have raised the question of who to blame.

Behind all this is the challenge that what is happening is fundamentally non-stationary: things are changing over time. It means that writing about the coronavirus epidemic is difficult as the figures change daily. A model which seemed perfectly adequate yesterday might well fail today.

Counterfactuals are another kind of dark data: the “what would have happened had this not happened”. Comparison of counterfactuals with actual outcomes can be used to explore the effectiveness of policies, but since by definition counterfactuals are unobserved this can pose technical challenges.

Incidentally, it is the role of politicians and managers, not scientists, to make decisions about how to act in the face of the evidence. Scientists can describe how the world works, as well as counterfactual worlds showing the likely impact of different policies. But politicians and managers have to balance the consequences and costs of the different actions. In stark terms in the present context, this is a balance between the number of deaths and the longer term damage to the economy. The difference in role of the scientists and the politicians is clearly brought to life in the nature of the statements made. A scientist might suggest that a social isolation policy will need to be maintained for a long period if the number of infections and deaths is to be reduced, while a politician might be more inclined to err more on the side of optimism.

Related to this is another unjustified criticism which has been voiced, referring to an inadequate stock of masks, ventilators, and so on. Fundamental economics means we cannot have vast stocks of such things sitting around gradually gathering dust, just in case they are needed, any more than we in the UK can have fleets of snowploughs lined up in underground garages just in case we get a foot of snow. It’s a question of an unknown (dark data) future. It is sensible for Norway to have fleets of snowploughs, where future snowfalls can be predicted with considerable certainty, but not for us, even if the UK grinds to a halt for a day or two in the rare event that we do get a centimetre of snow. There is an opportunity cost to creating and storing equipment which must be balanced against alternative uses of the money.

Another kind of dark data arises in terms of unintended, and unexpected, consequences. An obvious one, to which many have drawn attention, is the extent of damage inflicted on the economy by the hard policy of social isolation and lockdown. Quite clearly companies are going to the wall, so that people will lose their jobs in the long term, but quantifying this is very difficult. One consequence of this kind of dark data is that it makes choosing policies that much harder. It is analogous to the false positive/negative problem of diagnostic tests on a grand scale: one needs to balance the costs of the different policies, and this is tough if those costs are not known (though it has been done in the case of diagnostic systems – see, for example, https://www.hmeasure.net/ ).

Other possible consequences include people dying from other conditions as medical assessments and treatments unrelated to coronavirus are delayed or cancelled, lack of exercise leading to more heart attacks later on, an increased suicide rate due to depression arising from enforced isolation, increased rated of domestic abuse, increased rates of burglary of unattended commercial properties, both increased birth and divorce rates, dramatic decline in income to charities. It also seems likely that foreign travel will be dramatically curtailed for some time to come, and in some areas this could have serious adverse consequences, not just on the airline and cruise industries and because tourism is the major industry for many places, but in other areas as well. To take just one example, according to Universities UK [2], in 2017 international students coming to the UK contributed some £25 billion to the UK economy. Of the total number of 485,645 international students in the UK in 2018/19, some 120,385 came from China [3].

This note has described some of the ways that dark data impacts the models and course of the Covid-19 pandemic. There are others. To the extent that dark data is not taken into account in an analysis, the results, conclusions, and possibly the decisions can be at risk.

[1] https://www.universitiesuk.ac.uk/news/Pages/International-students-now-worth-25-billion-to-UK-economy—new-research.aspx

[2] https://www.studying-in-uk.org/international-student-statistics-in-uk/

[3] https://www.politico.com/news/magazine/2020/03/07/coronavirus-epidemic-prediction-policy-advice-121172