Oversampling and undersampling in data analysis

Within statistics, oversampling and undersampling in data analysis are techniques used to adjust the class distribution of a data set (i.e. the ratio between the different classes/categories represented). These terms are used both in statistical sampling, survey design methodology and in machine learning.

Oversampling and undersampling are opposite and roughly equivalent techniques. There are also more complex oversampling techniques, including the creation of artificial data points with algorithms like Synthetic minority oversampling technique.^[1]^[2]

^ "Scikit-learn-contrib/Imbalanced-learn". GitHub. 25 October 2021.
^ "Analyticalmindsltd/Smote_variants". GitHub. 26 October 2021.

[imbalanced-learn-1] "Scikit-learn-contrib/Imbalanced-learn". GitHub. 25 October 2021.

[smote-variants-2] "Analyticalmindsltd/Smote_variants". GitHub. 26 October 2021.

[1]

[2]