(C) Alec Muffett's DropSafe blog.

(C) Alec Muffett's DropSafe blog.
Author Name: Alec Muffett
This story was originally published on allecmuffett.com. [1]
License: CC-BY-SA 3.0.[2]

How (not) to deal with missing data: An economist’s take on a controversial study

2024-02-21 00:00:00

Gary Smith

Nearly 100 years ago, Muriel Bristol refused to drink a cup of tea that had been prepared by her colleague, the great British statistician Ronald Fisher, because Fisher had poured milk into the cup first and tea second, rather than tea first and milk second. Fisher didn’t believe she could tell the difference, so he tested her with eight cups of tea, half milk first and half tea first. When she got all eight correct, Fisher calculated the probability a random guesser would do so as well – which works out to 1.4%. He soon recognized that the results of agricultural experiments could be gauged in the same way – by the probability that random variation would generate the observed outcomes.

If this probability (the P-value) is sufficiently low, the results might be deemed statistically significant. How low? Fisher recommended we use a 5% cutoff and “ignore entirely all results which fail to reach this level.”

His 5% solution soon became the norm. Not wanting their hard work to be ignored entirely, many researchers strive mightily to get their P-values below 0.05.

For example, a student in my introductory statistics class once surveyed 54 classmates and was disappointed that the P-value was 0.114. This student’s creative solution was to multiply the original data by three by assuming each survey response had been given by three people instead of one: “I assumed I originally picked a perfect random sample, and that if I were to poll 3 times as many people, my data would be greater in magnitude, but still distributed in the same way.” This ingenious solution reduced the P-value to 0.011, well below Fisher’s magic threshold.

Ingenious, yes. Sensible, no. If this procedure were legitimate, every researcher could multiply their data by whatever number is necessary to get a P-value below 0.05. The only valid way to get more data is, well, to get more data. This student should have surveyed more people instead of fabricating data.

I was reminded of this student’s clever ploy when Frederik Joelving, a journalist with Retraction Watch, recently contacted me about a published paper written by two prominent economists, Almas Heshmati and Mike Tsionas, on green innovations in 27 countries during the years 1990 through 2018. Joelving had been contacted by a PhD student who had been working with the same data used by Heshmati and Tsionas. The student knew the data in the article had large gaps and was “dumbstruck” by the paper’s assertion these data came from a “balanced panel.” Panel data are cross-sectional data for, say, individuals, businesses, or countries at different points in time. A “balanced panel” has complete cross-section data at every point in time; an unbalanced panel has missing observations. This student knew firsthand there were lots of missing observations in these data.

The student contacted Heshmati and eventually obtained spreadsheets of the data he had used in the paper. Heshmati acknowledged that, although he and his coauthor had not mentioned this fact in the paper, the data had gaps. He revealed in an email that these gaps had been filled by using Excel’s autofill function: “We used (forward and) backward trend imputations to replace the few missing unit values….using 2, 3, or 4 observed units before or after the missing units.”

That statement is striking for two reasons. First, far from being a “few” missing values, nearly 2,000 observations for the 19 variables that appear in their paper are missing (13% of the data set). Second, the flexibility of using two, three, or four adjacent values is concerning. Joelving played around with Excel’s autofill function and found that changing the number of adjacent units had a large effect on the estimates of missing values.

Joelving also found that Excel’s autofill function sometimes generated negative values, which were, in theory, impossible for some data. For example, Korea is missing R&Dinv (green R&D investments) data for 1990-1998. Heshmati and Tsionas used Excel’s autofill with three years of data (1999, 2000, and 2001) to create data for the nine missing years. The imputed values for 1990-1996 were negative, so the authors set these equal to the positive 1997 value.

Overall, the missing observations in this data set are not evenly distributed across countries and years. IPRpro (an index of intellectual and property rights strength) is missing 79% of its data because there are only observations every four, five, or six years. Another variable, EDUter (government expenditures on tertiary education as a percentage of GDP) was said to be a “crucial determinant of innovativeness” but is missing 34% of its data.

Some countries are missing data for several consecutive years. For example, the variable MKTcap is the market capitalization of listed domestic companies measured as a percentage of gross domestic product (GDP). The MKTcap data end for Finland in 2005, Denmark in 2004, and Sweden in 2003, requiring 13, 14, and 15 years of imputed data, respectively. The MKTcap data for Greece don’t begin until 2001 (requiring 12 years of imputed data). Italy has MKTcap data for only 1999 through 2008. The authors imputed the values for the nine years before and the 10 years after this interval.

The most extreme cases are where a country has no data for a given variable. The authors’ solution was to copy and paste data for another country. Iceland has no MKTcap data, so all 29 years of data for Japan were pasted into the Iceland cells. Similarly, the ENVpol (environmental policy stringency) data for Greece (with six years imputed) were pasted into Iceland’s cells and the ENVpol data for Netherlands (with 2013-2018 imputed) were pasted into New Zealand’s cells. The WASTE (municipal waste per capita) data for Belgium (with 1991-1994 and 2018 imputed) were pasted into Canada. The United Kingdom’s R&Dpers (R&D personnel) data were pasted into the United States (though the 10.417 entry for the United Kingdom in 1990 was inexplicably changed to 9.900 for the United States).

The copy-and-pasted countries were usually adjacent in the alphabetical list (Belgium and Canada, Greece and Iceland, Netherlands and New Zealand, United Kingdom and United States), but there is no reason an alphabetical sorting gives the most reasonable candidates for copying and pasting. Even more troubling is the pasting of Japan’s MKTcap data into Iceland and the simultaneous pasting of Greece’s ENVpol data into Iceland. Iceland and Japan are not adjacent alphabetically, suggesting this match was chosen to bolster the desired results.

Imputation is attractive because it provides more observations and, if the imputed data are similar to the actual data, the P-values are likely to drop. In an email exchange with Retraction Watch, Heshmati said, “If we do not use imputation, such data is [sic] almost useless.”

Imputation sometimes seems reasonable. If we are measuring the population of an area and are missing data for 2011, it is reasonable to fit a trend line and, unless there has been substantial immigration or emigration, use the predicted value for 2011. Using stock returns for 2010 and 2012 to impute a stock return for 2011 is not reasonable.

Clearly, the more values are imputed, the less trustworthy are the results. It is surely questionable to use data for, say, 1999 through 2008 to impute values for 1990-1998 and 2009-2018. It is hard to think of any sensible justification for using 29 years of one country’s data to fill in missing cells for another country.

There is no justification for a paper not stating that some data were imputed and describing how the imputation was done. It is even worse to state the data had no missing observations. This paper might have been assessed quite differently – perhaps not been published at all – if the reviewers had known about the many imputations and how they were done.

Gary Smith is an economics professor at Pomona College. He has written (or co-authored) more than 100 peer-reviewed papers and 17 books, including the best-seller Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie With Statistics.

Like Retraction Watch? You can make a tax-deductible contribution to support our work, subscribe to our free daily digest or paid weekly update, follow us on Twitter, like us on Facebook, or add us to your RSS reader. If you find a retraction that’s not in The Retraction Watch Database, you can let us know here. For comments or feedback, email us at [email protected].

Share this: Email

Facebook

Twitter

[END]

[1] URL: https://retractionwatch.com/2024/02/21/how-not-to-deal-with-missing-data-an-economists-take-on-a-controversial-study/
[2] URL: https://creativecommons.org/licenses/by-sa/3.0/

DropSafe Blog via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/alecmuffett/