Dr. Khoshgoftaar and I presents two filtering techniques, the multiple-partitioning filter and the iterative-partitioning filter respectively.
This article was published in a special issue for on Data Mining Applications of the International Journal of Computer Applications in Technology (IJCAT) in Volume 27, Issue 4 in 2006. Details are available on the ACM portal or on the site of the publisher.
We present two new noise filtering techniques which improve the quality of training datasets by removing data points that are likely to be noisy. In addition, a new measure called ‘efficiency paired comparison’ is introduced for simplifying the comparison between two filters. The filtering techniques are based on the partitioning approach the training dataset is first split into subsets, and base learners are induced on each of these subsets. The predictions are then combined in such a way that an instance in the training data is identified as noisy if it is misclassified by a certain number of base learners. The first technique, multiple-partitioning filter combines several classifiers induced on each subset. The second technique, iterative-partitioning filter uses only one base learner but goes through multiple filtering iterations. The amount of noise removed by the techniques is varied by tuning either the filtering level or the number of iterations. Empirical studies using software measurement data from a high assurance software project assess the efficiencies of our two noise filtering approaches. The empirical results suggest that using several base classifiers as well as performing several iterations with a conservative filtering scheme can improve the efficiency of the filtering technique.