Neighbourhood-based undersampling approach for handling imbalanced and overlapped data.

Vuttipittayamongkol, Pattaramon; Elyan, Eyad

doi:10.1016/j.ins.2019.08.062

Neighbourhood-based undersampling approach for handling imbalanced and overlapped data.

Vuttipittayamongkol, Pattaramon; Elyan, Eyad

Authors

Pattaramon Vuttipittayamongkol

Professor Eyad Elyan e.elyan@rgu.ac.uk
Professor

Abstract

Class imbalanced datasets are common across different domains including health, security, banking and others. A typical supervised learning algorithm tends to be biased towards the majority class when dealing with imbalanced datasets. The learning task becomes more challenging when there is also an overlap of instances from different classes. In this paper, we propose an undersampling framework for handling class imbalance in binary datasets by removing potential overlapped data points. Our methods are designed to identify and eliminate majority class instances from the overlapping region. Accurate identification and elimination of these instances maximise the visibility of the minority class instances and at the same time minimises excessive elimination of data, which reduces information loss. Four methods based on neighbourhood searching with different criteria to identify potential overlapped instances are proposed in this paper. Extensive experiments using simulated and real-world datasets were carried out. Results show comparable performance with state-of-the-art methods across different common metrics with exceptional and statistically significant improvements in sensitivity.

Citation

VUTTIPITTAYAMONGKOL, P. and ELYAN, E. 2020. Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Information sciences [online], 509, pages 47-70. Available from: https://doi.org/10.1016/j.ins.2019.08.062

Journal Article Type	Article
Acceptance Date	Aug 26, 2019
Online Publication Date	Sep 3, 2019
Publication Date	Jan 31, 2020
Deposit Date	Sep 9, 2019
Publicly Available Date	Sep 4, 2020
Journal	Information Sciences
Print ISSN	0020-0255
Electronic ISSN	1872-6291
Publisher	Elsevier
Peer Reviewed	Peer Reviewed
Volume	509
Pages	47-70
DOI	https://doi.org/10.1016/j.ins.2019.08.062
Keywords	Imbalanced dataset; Undersampling; k-NN; Class overlap; Classification
Public URL	https://rgu-repository.worktribe.com/output/512732
Contract Date	Sep 9, 2019