Archive - Central European Conference on Information and Intelligent Systems, CECIIS - 2014

Font Size: 
An experimental comparison of classification algorithm performances for highly imbalanced datasets
Goran Oreški, Stjepan Oreški

Last modified: 2014-07-14

Abstract


Imbalanced learning data often emerges during the process of the knowledge discovery in data and presents a significant challenge for data mining methods. In this paper we investigate the influence of class imbalanced data on: artificial intelligence methods, i.e. neural networks and support vector machine and on classical classification methods represented by RIPPER and Naïve Bayes classifier. The research is conducted on classification problems and, in purpose of measuring the quality of classification, the accuracy and the area under ROC curve measures are used. For the reduction of the negative influence of imbalanced data, SMOTE oversampling technique is used. All experiments on 30 different data sets, obtained from KEEL (Knowledge Extraction based on Evolutionary Learning) repository, are conducted on original datasets, and repeated on balanced datasets generated using SMOTE technique. The results of the research indicate that imbalanced data have significant negative influence on AUC measure on neural network and support vector machine. The same methods are showing improvement of AUC measure when applied on balanced data, but at the same time, are showing the deterioration of results from aspect of the classification accuracy. RIPPER results are also similar, but the changes are of smaller magnitude, while results of Naïve Bayes classifier show overall deterioration of results on balanced distributions.