This thesis was written as part of the Master program in Computer Science at Florida Atlantic University (FAU). As a research graduate, I worked at the Empirical Software Engineering Laboratory (ESEL) under the supervision of Dr. Khoshgoftaar. My research was focused on data mining and its application to predict the quality of software programs. More precisely, I have developed some new filtering algorithms by combining learners in order to reduce the level of noise in software metrics repositories. This work has led to publications in journals such as Advances in Computer Science or IEEE Proceedings.
This thesis presents two new noise filtering techniques which improve the quality of training datasets by removing noisy data. The training dataset is first split into subsets, and base learners are induced on each of these splits. The predictions are combined in such a way that an instance is identified as noisy if it is misclassified by a certain number of base learners. The Multiple-Partitioning Filter combines several classifiers on each split. The Iterative-Partitioning Filter only uses one base learner, but goes through multiple iterations. The amount of noise removed is varied by tuning the filtering level or the number of iterations. Empirical studies on a high assurance software project compare the effectiveness of our noise removal approaches with two other filters, the Cross-Validation Filter and the Ensemble Filter. Our studies suggest that using several base classifiers as well as performing several iterations with a conservative scheme may improve the efficiency of the filter.