Abstract

Abstract This paper investigates a novel undersampling technique based on optimal transport for managing imbalanced datasets in classification tasks. Undersampling is crucial for reducing dataset size while preserving essential statistical properties, thereby improving both classification performance and computational efficiency. Existing methods, such as random undersampling, NearMiss, Tomek Links, and Edited Nearest Neighbor, often fail to optimally preserve data distribution. To address this, we propose a Wasserstein distance-based undersampling method that formulates an optimization problem to minimize distributional distortion. By leveraging the Wasserstein distance to measure dissimilarity between probability distributions, the proposed approach ensures that the reduced dataset retains key structural information. Additionally, we analyze the method's computational complexity and its impact on geodesic structures in Wasserstein space, highlighting its theoretical advantages over conventional techniques. Simulation results on synthetically generated imbalanced datasets demonstrate that the proposed method preserves the statistical structure of the original data more effectively than existing resampling techniques and achieves balanced classification performance for both the majority class and minority class, offering an effective and scalable solution to class imbalance in classification problems.

Affiliated Institutions

Related Publications

Publication Info

Year
2025
Type
article
Pages
1-13
Citations
0
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

0
OpenAlex

Cite This

Sungjun Seo, Mohammad Afrazi, Kooktae Lee (2025). An Optimal Transport-based Undersampling Technique for Handling Imbalanced Datasets. Journal of Dynamic Systems Measurement and Control , 1-13. https://doi.org/10.1115/1.4070589

Identifiers

DOI
10.1115/1.4070589