Abstract
Abstract This paper investigates a novel undersampling technique based on optimal transport for managing imbalanced datasets in classification tasks. Undersampling is crucial for reducing dataset size while preserving essential statistical properties, thereby improving both classification performance and computational efficiency. Existing methods, such as random undersampling, NearMiss, Tomek Links, and Edited Nearest Neighbor, often fail to optimally preserve data distribution. To address this, we propose a Wasserstein distance-based undersampling method that formulates an optimization problem to minimize distributional distortion. By leveraging the Wasserstein distance to measure dissimilarity between probability distributions, the proposed approach ensures that the reduced dataset retains key structural information. Additionally, we analyze the method's computational complexity and its impact on geodesic structures in Wasserstein space, highlighting its theoretical advantages over conventional techniques. Simulation results on synthetically generated imbalanced datasets demonstrate that the proposed method preserves the statistical structure of the original data more effectively than existing resampling techniques and achieves balanced classification performance for both the majority class and minority class, offering an effective and scalable solution to class imbalance in classification problems.
Affiliated Institutions
Related Publications
A study of the behavior of several methods for balancing machine learning training data
There are several aspects that might influence the performance achieved by existing learning systems. It has been reported that one of these aspects is related to class imbalanc...
A systematic study of the class imbalance problem in convolutional neural networks
In this study, we systematically investigate the impact of class imbalance on classification performance of convolutional neural networks (CNNs) and compare frequently used meth...
Cost-sensitive learning by cost-proportionate example weighting
We propose and evaluate a family of methods for converting classifier learning algorithms and classification theory into cost-sensitive algorithms and theory. The proposed conve...
<i>S</i>-Estimators for Functional Principal Component Analysis
Principal component analysis is a widely used technique that provides an optimal lower-dimensional approximation to multivariate or functional datasets. These approximations can...
Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction
For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the t...
Publication Info
- Year
- 2025
- Type
- article
- Pages
- 1-13
- Citations
- 0
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1115/1.4070589