An Optimal Transport-based Undersampling Technique for Handling Imbalanced Datasets

Sungjun Seo; Mohammad Afrazi; Kooktae Lee

doi:10.1115/1.4070589

Abstract

Abstract This paper investigates a novel undersampling technique based on optimal transport for managing imbalanced datasets in classification tasks. Undersampling is crucial for reducing dataset size while preserving essential statistical properties, thereby improving both classification performance and computational efficiency. Existing methods, such as random undersampling, NearMiss, Tomek Links, and Edited Nearest Neighbor, often fail to optimally preserve data distribution. To address this, we propose a Wasserstein distance-based undersampling method that formulates an optimization problem to minimize distributional distortion. By leveraging the Wasserstein distance to measure dissimilarity between probability distributions, the proposed approach ensures that the reduced dataset retains key structural information. Additionally, we analyze the method's computational complexity and its impact on geodesic structures in Wasserstein space, highlighting its theoretical advantages over conventional techniques. Simulation results on synthetically generated imbalanced datasets demonstrate that the proposed method preserves the statistical structure of the original data more effectively than existing resampling techniques and achieves balanced classification performance for both the majority class and minority class, offering an effective and scalable solution to class imbalance in classification problems.

Affiliated Institutions

New Mexico Institute of Mining and Technology US

Related Publications

A study of the behavior of several methods for balancing machine learning training data

Gustavo E. A. P. A. Batista , Ronaldo C. Prati , Maria Carolina Monard

There are several aspects that might influence the performance achieved by existing learning systems. It has been reported that one of these aspects is related to class imbalanc...

2004 ACM SIGKDD Explorations Newsletter 3876 citations

A systematic study of the class imbalance problem in convolutional neural networks

Mateusz Buda , Atsuto Maki , Maciej A. Mazurowski

In this study, we systematically investigate the impact of class imbalance on classification performance of convolutional neural networks (CNNs) and compare frequently used meth...

2018 Neural Networks 2639 citations

Cost-sensitive learning by cost-proportionate example weighting

Bianca Zadrozny , John Langford , Naoki Abe

We propose and evaluate a family of methods for converting classifier learning algorithms and classification theory into cost-sensitive algorithms and theory. The proposed conve...

2004 669 citations

<i>S</i>-Estimators for Functional Principal Component Analysis

Graciela Boente , Matías Salibián‐Barrera

Principal component analysis is a widely used technique that provides an optimal lower-dimensional approximation to multivariate or functional datasets. These approximations can...

2014 Journal of the American Statistical A... 42 citations

Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction

Gary M. Weiss , Foster Provost

For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the t...

2003 Journal of Artificial Intelligence Re... 918 citations

Publication Info

Year: 2025
Type: article
Pages: 1-13
Citations: 0
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

An Optimal Transport-based Undersampling Technique for Handling Imbalanced Datasets

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

OpenAlex

Cite This

APA Style

                            
                                    Sungjun Seo, 
                                
                                    Mohammad Afrazi, 
                                
                                    Kooktae Lee
                                
                            (2025). 
                            An Optimal Transport-based Undersampling Technique for Handling Imbalanced Datasets. 
                            Journal of Dynamic Systems Measurement and Control
                            
                            , 1-13.
                            https://doi.org/10.1115/1.4070589

Identifiers

DOI: 10.1115/1.4070589