Two California physicians recently used data they collected in their private urgent-care facilities to extrapolate COVID-19 illness and mortality rates for the state. A number of news outlets reported these new “findings” as fact, simply because they didn’t understand the data or how to interpret them. This is problematic for a number of reasons, not the least of which is that studies have shown that even when misinformation is later corrected, it is not forgotten; in fact, it can make the belief in the misinformation stronger. In other words, the damage has been done.
This is why it is key to be able to understand and interpret data, especially in an age when information is easily accessible, easily manipulated, and easily shared. There are many aspects of data that must be understood in order to determine if a study or conclusion is warranted. While these aspects are not specifically mentioned in the NGSS, their essence can be found in the standards, specifically in the Science and Engineering Practice (SEP) known as Analyzing and Interpreting Data. The following information explains why each aspect of understanding and interpreting data is important, how the data used can be misleading (or just plain wrong) if these aspects are ignored, and what we can do to ensure that the information we are receiving is based on accurate data analysis.
High School SEP: Consider limitations of data analysis (e.g., measurement error, sample selection) when analyzing and interpreting data.
The two California physicians used COVID-19 data they collected in their urgent care facilities to make their determinations about the general population. But people who go to an urgent care facility have reason to believe they are ill. This is therefore a self-selecting group rather than a cross-section (also known as a random sampling) of the population.
More simply put, this is the equivalent of asking people who are leaving an opera performance whether they like opera, or asking people walking into an optician’s office if they’re having trouble with their eyesight—they probably wouldn’t be there if they didn’t. Let’s say you did either of these surveys and found that 9 out of 10 people said yes—would it then be right to conclude that 90% of the human race likes opera and wears glasses? No, because your data set was self-selecting. All you could conclude is that 90% of people who went to this particular opera at this particular location like opera, and 90% of people who went to this particular optician on that particular day think they have difficulties with their eyes.
Incomplete Data Sets
Middle School SEP: Consider limitations of data analysis (e.g., measurement error), and/or seek to improve precision and accuracy of data with better technological tools and methods (e.g., multiple trials).
The California doctors used data they collected recently to make these calculations and compare the data to flu data from previous years. But COVID-19 is an ongoing pandemic that we are only a few months into, and the flu data the doctors used for comparison are from completed flu events—whole seasons’ worth of data. But logic tells us that data collected part-way through a pandemic cannot be used to show the pandemic “isn’t as bad” as a past flu based on flu data collected for a completed flu season.
Otherwise, using the same logic, you could argue that the birth rate for 2020 is lower than 2019 because there have been fewer babies born this year. While it is true that fewer babies have been born so far in 2020, comparing data from a half year to a whole year to conclude that there has been a decrease in birthrate is clearly misleading at best.
Similarly, using the same logic, you could declare a winner in an election with only 62% of precincts reporting results, as the Iowa Democratic Party did in February, which Twitter users were quick to point out was problematic.
Correlation without Causation
Elementary School SEP: Analyze and interpret data to make sense of phenomena, using logical reasoning, mathematics, and/or computation.
Middle School SEP: Distinguish between causal and correlational relationships in data.
All over the internet you can find theories about all manner of things, including that 5G is somehow linked to COVID-19. The reason this correlation has been erroneously raised is simple (obviously, too simple): there was an increase in 5G availability around the time we started experiencing COVID-19. We’ve also had an increase in Doctor Dolittle movies, the amount of snowfall in New England, the number of hours of sunlight in the northern hemisphere, and the number of baby boys named Kylo in the United States, and yet these phenomena were all mercifully spared the accusations of causing the pandemic because, as a scientist would put it, correlation does not indicate causation. Just because two things share a trend does not mean that you can just decide one of them caused the other.
Without considering causation, you can form lots of illogical arguments by just looking at data sets. For instance, you could argue that plant growth determines how much it will rain, rather than the other way around. Or that the number of galoshes sold is caused by the number of umbrellas sold, rather than that both are determined by the weather. Or, as seen in the graph below, that the number of movies Nicholas Cage stars in determines how many people will drown in swimming pools that year, rather than that they are coincidentally similar and are otherwise completely unrelated.
Just because two data sets seem to correlate does not mean one caused the other, or even that they are related at all.
Knowing This, How Should We Judge Data?
When we are given new information, we need to ask ourselves a series of questions.
Is this a reputable source?
Checking the credentials of the people presenting the information and determining what evidence they have, whether the evidence was peer-reviewed, and where it was published would help remove misinformation before it is ever relayed to others.
Do the data make sense?
You know more than you give yourself credit for. Often, our gut instincts are just trying to alert us that something isn’t quite right and we should look more closely at something.
If you were told, for instance, that bicycling to work twice a week and carpooling twice a week would reduce your carbon footprint by the same amount, would this make sense to you? Or would you, without having any actual data, know logically that a person does not produce the same level of emissions that a car produces? Consider statements critically, follow logic, and do some research to confirm the assertions being made.
Are these things comparable?
Data sets can be related but not comparable simply because they are not treated the same way or in a way that makes sense for the hypothesis being tested. In order to determine if data sets can be compared, you must ask questions about how the data were collected, analyzed, and presented to the public.
For instance, let’s say a study is determining how to allocate additional funding for public transportation. The study says that 100,000 people use public transportation in City X, while only 90,000 people use public transportation City Y, and therefore City X should receive all the funding. But what are the population sizes of these cities? If 100,000 people is 33% of the population of City X, but 90,000 people is 90% of the population of City Y, does it make sense to compare the number of people instead of the percentage of the population? What about the time periods? If these values are the daily averages, and City X’s average is for the whole year while City Y’s average is for a month with lots of holidays (such as December), does it make sense to compare these numbers at all?
Is there an evident cause?
Remember, even after answering all the questions above, correlation does not determine causation. It is important to look at the data sets critically and to determine whether there is any logical reason to think that one caused the other.
Let’s say you’re presented with a graph showing a high correlation of two data sets. Can you logically determine whether one of the data sets is causing the other? Is it possible, for instance, that spiders are angered by children’s abilities to spell long words? (No) Or that the fear of venomous spiders causes children to stay home and memorize words? (Probably not) Or are they both the result of another cause, or even sheer coincidence? (Yes)
The Importance of Understanding Data
It is only by asking these questions that we are better able to judge the validity of claims and decide for ourselves what the data actually tell us. Research has shown that even scientists can reach false conclusions in their own studies because “statistical tests are misused, misinterpreted, or misunderstood.”
It is important that people, including students, learn to understand and interpret data for themselves because statistics can be misleading. As Mark Twain is reported to have said, “Facts are stubborn things, but statistics are more pliable.”