Data dredging

A humorous example of a result produced by data dredging, showing a coincidental correlation between the number of letters in Scripps National Spelling Bee's winning word and the number of people in the United States killed by venomous spiders

Data dredging (also known as data snooping or p-hacking)[1][a] is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. This is done by performing many statistical tests on the data and only reporting those that come back with significant results.[2]

The process of data dredging involves testing multiple hypotheses using a single data set by exhaustively searching—perhaps for combinations of variables that might show a correlation, and perhaps for groups of cases or observations that show differences in their mean or in their breakdown by some other variable.

Conventional tests of statistical significance are based on the probability that a particular result would arise if chance alone were at work, and necessarily accept some risk of mistaken conclusions of a certain type (mistaken rejections of the null hypothesis). This level of risk is called the significance. When large numbers of tests are performed, some produce false results of this type; hence 5% of randomly chosen hypotheses might be (erroneously) reported to be statistically significant at the 5% significance level, 1% might be (erroneously) reported to be statistically significant at the 1% significance level, and so on, by chance alone. When enough hypotheses are tested, it is virtually certain that some will be reported to be statistically significant (even though this is misleading), since almost every data set with any degree of randomness is likely to contain (for example) some spurious correlations. If they are not cautious, researchers using data mining techniques can be easily misled by these results. The term p-hacking (in reference to p-values) was coined in a 2014 paper by the three researchers behind the blog Data Colada, which has been focusing on uncovering such problems in social sciences research.[3][4][5]

Data dredging is an example of disregarding the multiple comparisons problem. One form is when subgroups are compared without alerting the reader to the total number of subgroup comparisons examined.[6]

  1. ^ Wasserstein, Ronald L.; Lazar, Nicole A. (2016-04-02). "The ASA Statement on p-Values: Context, Process, and Purpose". The American Statistician. 70 (2). Informa UK Limited: 129–133. doi:10.1080/00031305.2016.1154108. ISSN 0003-1305.
  2. ^ Cite error: The named reference bmj02 was invoked but never defined (see the help page).
  3. ^ Lewis-Kraus, Gideon (2023-09-30). "They Studied Dishonesty. Was Their Work a Lie?". The New Yorker. ISSN 0028-792X. Retrieved 2023-10-01.
  4. ^ Subbaraman, Nidhi (2023-09-24). "The Band of Debunkers Busting Bad Scientists". Wall Street Journal. Archived from the original on 2023-09-24. Retrieved 2023-10-08.
  5. ^ "APA PsycNet". psycnet.apa.org. Retrieved 2023-10-08.
  6. ^ Young, S. S.; Karr, A. (2011). "Deming, data and observational studies" (PDF). Significance. 8 (3): 116–120. doi:10.1111/j.1740-9713.2011.00506.x.


Cite error: There are <ref group=lower-alpha> tags or {{efn}} templates on this page, but the references will not show without a {{reflist|group=lower-alpha}} template or {{notelist}} template (see the help page).


© MMXXIII Rich X Search. We shall prevail. All rights reserved. Rich X Search