Can we trust the data on Covid-19?

Note: This article is NOT concerned with the correctness of various models that attempt to predict how Covid-19 may spread in the future. This, rightly, is the domain of epidemiologists. Our interest here is on the trustworthiness of the data on confirmed cases that have so far been reported.

Most of us will be familiar by now with the numbers and graphs showing how Covid-19 infections are growing throughout the world. We are exhorted to isolate ourselves to help ‘flatten the curve’. We anxiously watch the daily numbers rise and our country’s leaders are eager to seize on any sign that the steepness of the curve is reducing.

But how trustworthy is the data we are seeing? Where does it come from? How is it collected? Are different health jurisdictions really comparable?

Trust in data comes down to half a dozen key questions:

Source – Where did it come from and who is responsible for it?

Standards – Are there standard definitions of the terms?

Method – What method was used to create or collect it?

Sustainable – Is lineage on source, methods and standards (realising they may evolve) available over the long term?

Issues – Are there known issues with the data quality?

References – Who else relies on this data?

Let’s take a look at some of these questions in relation to Covid-19 data.

Many readers will have accessed Covid-19 data through their favourite media channels. For example, Guardian Australia publishes “live statistics” on Covid-19 cases and deaths for Australia, UK, US, and the whole world. The data is presented in an easy-to-understand set of graphs and total counts, giving readers ready access to the latest information.

Where does the Guardian source the data from? The source for Australian data is clearly stated as “Guardian Australia analysis of state and territory data”. That is, the Guardian collates the daily releases of updated data from the health agencies of each state and territory, combined with some unspecified “analysis”. The Guardian warns readers that the “most recent day is usually based on incomplete data” and always states the date when the chart was last updated. This information tells us something about the method the Guardian employs, but is it enough?

Another media outlet publishing Covid-19 data for Australia is the Mandarin, which publishes similar charts and infographics on a dedicated website. In this case, the data is collated by volunteers from, once again, data announced by the state and territory governments. The state and territory authorities state on their websites that the number of confirmed cases for previous days is updated as further cases are notified to them by laboratories and physicians. It is not clear whether the methods used by the Guardian or the Mandarin make corresponding updates to the data for previous days.

The federal health department requires all health authorities to report cases of infectious diseases, including Covid-19, through the National Notifiable Diseases Surveillance System. The department collates these notifications of individual cases and publishes updated national and state/territory totals each day.

Since all three of the above data publishers appear to source their data directly from state and territory health authorities, we would expect the numbers to match, wouldn’t we? Unfortunately, this is not quite the case. For example, the daily totals of new confirmed cases for two recent days are:

26 March 2020

27 March 2020

Guardian Australia

383

385

The Mandarin

379

371

Commonwealth Department of Health

360

355

The data from the Department of Health is markedly different (lower) to the two media outlets due to a different data lineage. It appears that the Department receives notifications of individual cases of all infectious diseases from the state and territory authorities, then aggregates these into totals. On the other hand, the media outlets base their numbers on those announced directly by the states and territories. The (smaller) difference between the two media outlets is less significant but harder to explain.

Johns Hopkins University has an impressive and apparently authoritative GitHub repository of global data and a map-based dashboard. This repository is relied on by many media outlets, including Guardian Australia. The sources and lineage of the data JHU present is complex (e.g. 15 data sources are acknowledged, which at times present conflicting data). This example illustrates that the data we see on a media outlet’s website may have a long and convoluted lineage from the original notifications of Covid-19 cases in each country. Can we trust that the lineage has not messed with the data?

Misaligned and changed daily cut-off times also play a part in differing daily totals in Australia: one state agency may announce its updated case total at 3:00pm while another announces their total at 8:00pm. In both cases the total is for cases confirmed in the previous 24 hours. On the other hand, the federal department of health’s data is based on the date a confirmed case is notified – this may occur a day or two, or even later, after the case was confirmed. Data for earlier dates may be updated as new notifications are received.

Finally, there is the question of standards. What does confirmed case mean? Does each jurisdiction use the same definition of the term, or a different definition, albeit only slightly different? The World Health Organisation sets out “case definitions” to guide health authorities in global surveillance of the disease. The WHO defines a confirmed case of Covid-19 as a “person with laboratory confirmation of COVID-19 infection, irrespective of clinical signs and symptoms” and it encourages health authorities to state the definition they use when publishing their data.

The Commonwealth Department of Health definition of a confirmed case is “a person who tests positive to a validated specific SARS-CoV-2 nucleic acid test or has the virus identified by electron microscopy or viral culture, at a reference laboratory”. This is a clear and unambiguous definition that clearly excludes people who have Covid-19 symptoms but have not received a positive test result, or have been tested by an unauthorised method. Many state and territory websites do not state that they conform to this definition of confirmed case, but perhaps we can assume they do?

Australian confirmed cases and those in other countries may not be comparable if the definition varies. Do all countries conform to the WHO case definition? We should be cautious about trusting international comparisons unless we understand all the relevant definitions.

The Department of Health notes that the definition has evolved, and may evolve further, as the outbreak has progressed. Each Covid-19 weekly report states the current definition of both confirmed case and suspect case. These evolving definitions make comparison over time somewhat problematic.

In conclusion, do these small differences in the number of confirmed cases really matter? On one level, the progression of the disease is evident (and troubling) no matter which number you look at. The numbers matter however when a commentator or a politician claims that “the curve is flattening” based on a perceived reduction in new cases compared to a week ago.

Quite apart from epidemiological reasons why a reduction in cases may or may not be significant, we need to ask questions of the data. Are they comparing the data from a media outlet for last week against the data published by the federal health department for this week? Has the method of data compilation changed? Has the definition of ‘confirmed case’ changed? Are the health practitioners and authorities overwhelmed, causing them to be less timely in notifying cases?

These are good questions that guide us to look at data critically to decide whether we trust it. By learning to ask these questions of the Covid-19 data, we can learn to ask similar questions to understand the trustworthiness of the data we use in our usual daily work.

In another blog, we’ll have a look at how the rate of Covid-19 testing impacts the trustworthiness of the confirmed cases data.

Author: Graham Wilson

Graham is a business architect with 30 years' experience on Australian and New Zealand government agencies. He is skilled at steering a path between business and IT. Graham has been responsible for guiding business representatives on the architecture of significant government initiatives.

Get the latest on data management in your inbox

We are an established data consultancy, working on some of Australia’s
biggest data management projects across seven capital cities.

.
Tim Goswell Practice lead

Connect with Tim

Todd Heather

Connect with Todd

James Bell

Connect with James

Lloyd Robinson Director

Connect with Lloyd

How can we help