Learning from Imbalanced Data

Haibo He; Edwardo A. Garcia

doi:10.1109/tkde.2008.239

Abstract

With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.

Keywords

Computer scienceData scienceRaw dataMachine learningArtificial intelligenceBig dataField (mathematics)Data mining

Affiliated Institutions

Stevens Institute of Technology US

Related Publications

Survey on deep learning with class imbalance

Justin Johnson , Taghi M. Khoshgoftaar

Abstract The purpose of this study is to examine existing deep learning techniques for addressing class imbalanced data. Effective classification with imbalanced data is an impo...

2019 Journal Of Big Data 2538 citations

Class imbalances versus small disjuncts

Taeho Jo , Nathalie Japkowicz

It is often assumed that class imbalances are responsible for significant losses of performance in standard classifiers. The purpose of this paper is to the question whether cla...

2004 ACM SIGKDD Explorations Newsletter 669 citations

SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

Alberto Fernández , Salvador García , Francisco Herrera +1 more

The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered "de facto" standard in the framework of learning from imbalanced data. This is due to...

2018 Journal of Artificial Intelligence Re... 1895 citations

Improving Performance in Neural Networks Using a Boosting Algorithm

Harris Drucker , Robert E. Schapire , Patrice Simard

A boosting algorithm converts a learning machine with error rate less than 50% to one with an arbitrarily low error rate. However, the algorithm discussed here depends on having...

1992 Neural Information Processing Systems 158 citations

A systematic study of the class imbalance problem in convolutional neural networks

Mateusz Buda , Atsuto Maki , Maciej A. Mazurowski

In this study, we systematically investigate the impact of class imbalance on classification performance of convolutional neural networks (CNNs) and compare frequently used meth...

2018 Neural Networks 2639 citations

Publication Info

Year: 2009
Type: article
Volume: 21
Issue: 9
Pages: 1263-1284
Citations: 8871
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Learning from Imbalanced Data

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

8871

OpenAlex

Cite This

APA Style

                            
                                    Haibo He, 
                                
                                    Edwardo A. Garcia
                                
                            (2009). 
                            Learning from Imbalanced Data. 
                            IEEE Transactions on Knowledge and Data Engineering
                            , 21
                            (9)
                            , 1263-1284.
                            https://doi.org/10.1109/tkde.2008.239

Identifiers

DOI: 10.1109/tkde.2008.239