1 Introduction
In recent years, the motivation behind applying feature selection techniques has evolved significantly. What was once merely an illustrative example has now become a crucial prerequisite for effective model building. This shift in emphasis can be attributed to several factors, including improved generalization performance, reduced running time requirements, and the need to address constraints and interpretational challenges inherent in the problem domain.
Feature selection is a vital dimensionality reduction technique in data mining, involving the selection of a subset of original features based on specific criteria.
This process is important and commonly utilized to enhance the efficiency and effectiveness of data analysis tasks [1, 2, 3].
It reduces the number of features, removes irrelevant, redundant, or noisy data, and brings the immediate effects for applications: speeding up a data mining algorithm, and improving mining performance such as predictive accuracy and result comprehensibility.
Therefore, it is essential to employ an effective feature selection method that considers the number of features used for sample classification to enhance processing speed, predictive accuracy, and comprehensibility.
The correlation between features significantly impacts classification outcomes. Removing important features can reduce classification accuracy and negatively affect the quality of SVM models.
Similarly, certain features may have no discernible effect or may be laden with high levels of noise [4]. Their removal increases the classification accuracy rate.
The aim of feature selection is to find the smallest feature subset that increases the classification accuracy rate.
The optimal features subset is not unique; it may be possible to achieve the same accuracy rate using different sets of features, because if two features are correlated one can replace by other.
Note that feature subset selection chooses a set of features from existing features, and does not construct new ones; there is no feature extraction or construction [5, 6].
In this study, we analyze and evaluate the performance of several feature selection techniques (20 algorithms) by using the criterion of stability and the classification accuracy rate calculates with SVM-SMO.
The experimentation is conducted over 4 datasets obtained from UCI machine learning repository.
The paper is organized as follows. In section 2, we give an overview of SVM.
In section 3, we describe the stability criteria used in the literature. In section 4, we discuss the different feature selection techniques.
Section 5 describes the results obtained by the approaches. Finally, concluding remarks are made in section 6.
2 Overview of Support Vector Machine
SVM can be briefly described as follows [7, 8, 9]. Consider
This is the dual form of the quadratic problem, C represents the regularization parameter. To solve the optimization problem in Support Vector Machines (SVM), quadratic optimization algorithms are utilized.
Some commonly used algorithms include: Sequential Minimal Optimization [10, 11], Trust Region, etc. By solving the optimization problem, we determine the Lagrange multipliers, the optimal hyperplane is given by:
where
3 Feature Selection Algorithm
Feature selection is a domain garnering growing attention within the realm of machine learning. Numerous feature selection techniques have been outlined in literature dating back to the 1970s.
Feature selection algorithms are categorized into three main types based on their strategies: filter, wrapper, and embedded models.
Filter feature selection methods do not consider classifier properties; instead, they conduct statistical tests on variables. In contrast, wrapper feature selection evaluates various feature sets by constructing classifiers.
Embedded model algorithms integrate variable selection into the training process, deriving feature relevance analytically from the learning model's objective. Table 1 summarizes some feature selection criteria and algorithms.
Methods | Full Name |
MRMR | Max-Relevance Min-Redundancy [20,18] |
CMIM | Conditional Mutual Info Maximisation [13,18] |
JMI | Joint Mutual Information [14,18] |
DISR | Double Input Symmetrical Relevance [15,18] |
CIFE | Conditional Infomax Feature Extraction [16,18] |
ICAP | Interaction Capping [17,18] |
CONDRED | Conditional Redundancy [18] |
BETAGAMMA | BetaGamma [18] |
MIFS | Mutual Information Feature Selection [19,18] |
CMI | Conditional Mutual Information [18] |
MIM | Mutual Information Maximisation [12,18] |
RELIEF | Relief [18] |
FCBF | Fast Correlation Based Filter [21,27] |
MRF | Markov Random Fields [26] |
SPEC | Spectral [22,27] |
T-TEST | Student’s T-test [27] |
KRUSKAL-WALLIS | Kruskal-Wallis Test [23,27] |
FISHER | Fisher Score [24,27] |
GINI | Gini Index [25,27] |
GA | Genetic Algorithm |
4 Stability of Feature Selection Algorithm
The stability of a feature selection algorithm refers to how sensitive it is to changes in feature preferences or rankings. It quantifies how different training set affect the feature preferences [31]. To calculate the stability, we require a similarity measure for feature preferences: Consider two subsets A and B we denote:
| . | The cardinality.
∩ The union.
U The intersection.
4.1 SS Stability
Kalousis et al. [29] define the similarity index between two subsets, A and B, as:
The SS stability is a simple adaptation of the Tanimoto, which measures the similarity distance between two sets A and B. SS takes values in [0,1] with 0 meaning that there is no overlap between the two sets, and 1 that the two sets are identical.
4.2 SH Stability
Dunne et al. [30] calculates the similarity between two subsets by comparing the relative Hamming distance of their corresponding masks. In set notation, this method can be described as follows:
4.3 Kuncheva Stability
Kuncheva [32] define the consistency index for two subsets with the same cardinality as:
where
5 Experimental Results
In this section, we have made a comparison protocol between the several feature selections techniques defined in the literature and shown the performance of each technique.
The experiment is analyzed by using the following performance measures: classification accuracy rate calculated by using the support vector machine. Also, we use the stability criteria: Kuncheva stability, Information stability, SS and SH stability. Table 2 presents a summary of four selected datasets used in the feature selection experiment:
Datasets | Number of classes | Number of instances | Number of features |
Breast Cancer | 2 | 699 | 9 |
Cardiotocography | 2 | 1831 | 21 |
ILPD | 2 | 583 | 9 |
Mammographic Mass | 2 | 961 | 5 |
WDBC (Wisconsin Dataset Breast Cancer), Cardiotocography, ILPD (Indian Liver Patient Dataset), and Mammographic Mass. The performance evaluation of feature selection techniques requires the determination of the training and testing set.
In this study, we split randomly the initial dataset by using the hold out method which is a king of cross validation. In this experiment, less than one-third of the initial data is allocated for testing purposes. Specifically, 60% of the instances are designated for training, while the remaining 40% are reserved for testing.
Table 3 outlines the number of instances utilized during both the training and testing phases for each dataset. To compare the feature selection criteria defined above, we proceed as follows: for each data set, we select different training set and we take a set of features for each training set by using each feature selection criterion.
Datasets | Missing instances | Training set | Testing set |
Breast Cancer | 16 | 411 | 272 |
Cardiotocography | 0 | 1101 | 730 |
ILPD | 0 | 351 | 232 |
Mammographic Mass | 131 | 500 | 330 |
The following figures 2,3,4,5 show the Kuncheva’s Stability, Information Stability, SS Stability and SH Stability measures over 4 datasets for each feature selection criterion. For each data set we calculate the stability for different training set obtained by using the hold out method which selects randomly a training set.
10 training sets are selected for each data set, we use this principle to better exploit each dataset. The results show that for all the training set which are selected randomly for each data sets, all the methods are stable except GA, CMI, T-test, Fisher, Gini, and relief.
The stability for JMI, MRMR, Disr, Condred, Mifs, FCBF, MRF and Kruskal-Wallis is equal to 1 for all the datasets, this means that these methods have select the same subset of feature for each training set of the four datasets. Therefore, theses feature selection criterions have selected the relevant subset of feature.
5.1 Comparison and Discussion
The figures 2,3,4,5 show the stability criteria for each feature selection techniques over the four datasets. The results show that MRMR, JMI, DISR, CIFE, ICAP, CONDRED, KRUSKAL-WALLIS and MRF have a stability value around 1.
This means that these feature selection criterions have selected the same feature selection for all the training sets in each data set. Figures 6,7,8,9 illustrate the stability criteria versus the number of features.
The analysis of the results indicates that the Relief, GA, and Ficher methods exhibit lower stability (measured by Kurcheva’s, Information, SS, and SH metrics) across all datasets. Therefore, we conclude that these techniques are instable compared to the MRMR, JMI, DISR, CIFE, ICAP, CONDRED, KRUSKAL-WALLIS and MRF which have given an average stability close to 1.
The classification accuracy rate represents an important term to evaluate the performance of feature selection techniques.
In the figure 10 describes the classification accuracy rate for each number of features obtained by each feature selection criterions.
In term of classification accuracy rate, we show clearly that the both Spectrum and MRF methods have provided the lower classification accuracy rate. The higher accuracy rate for the WDBC data set is reached by the both JMI and MIM methods with 5 features.
For the ILPD dataset, we record the high accuracy for the CMIM and CIFE methods with 6 features. In the Cardiotocography data set, the high classification accuracy rate is achieved with Fisher score by using 13 features. For the Mammographic Mass data set, we record high accuracy for the CMIM and JMI methods with 2 features.
The figure 11 illustrates the classification accuracy rate obtained by the four better features selected by these methods in each test. We use the hold out method to generate 20 training sets for each data sets and we calculate the classification accuracy rate for each training sets by using the four better features.
There is different interpretation; each feature selection method is adapted to a special data set. We calculate the average classification accuracy rate obtained in each test and we summarize the results in the following table.
Methods | Average classification accuracy rate (%) | |||
WDBC | ILPD | Cardio | Mammo | |
MRMR | 97,11 | 55,15 | 88,68 | 83,27 |
CMIM | 97,07 | 58,24 | 95,21 | 82,98 |
JMI | 97,07 | 58,22 | 95,21 | 82,98 |
DISR | 96,43 | 58,22 | 97,00 | 82,60 |
CIFE | 96,89 | 58,22 | 96,55 | 82,98 |
ICAP | 96,89 | 58,24 | 95,21 | 82,98 |
CONDRED | 95,80 | 58,24 | 96,55 | 82,98 |
BETAGAMMA | 96,36 | 58,24 | 96,55 | 83,67 |
MIFS | 96,89 | 55,15 | 88,68 | 83,27 |
CMI | 97,07 | 58,26 | 87,00 | 82,98 |
MIM | 96,76 | 58,24 | 95,21 | 83,07 |
RELIEF | 95,95 | 60,80 | 97,50 | 79,53 |
FCBF | 96,15 | 55,64 | 78,28 | 83,27 |
MRF | 95,56 | 58,20 | 78,28 | 78,77 |
SPEC | 95,79 | 57,53 | 80,33 | 79,53 |
T-TEST | 93,25 | 59,95 | 88,56 | 79,53 |
KRUSKAL-WALLIS | 95,99 | 56,73 | 97,85 | 82,98 |
FISHER | 96,76 | 59,95 | 96,82 | 79,53 |
GINI | 96,76 | 58,83 | 96,95 | 83,07 |
GA | 96,32 | 57,05 | 95,96 | 83,07 |
The goal of feature selection is to achieve a balance between the stability of a criterion and the classification accuracy rate (Gulgezen et al. 2009). This is why, experimental protocol was to take the average classification accuracy rate obtained by the 20 training sets plotted with the Kuncheva’s Stability, Information Stability, SS Stability and SH Stability. Figures 12, 13, 14, 15 show the stability criterions versus the means accuracy rate. The goal is the find the set of feature selection criterions which the higher classification accuracy rate and the higher stability, this set is called the Pareto-Optimal Set.
The criteria which belonging the Pareto-Optimal set is said to be non-dominated [18]. Hence, it is evident from each subplot of Figures 12, 13, 14, and 15 that feature selection techniques positioned towards the top right of the space dominate over those towards the bottom left. Given this observation, there is no justification for selecting techniques located at the bottom left [18].
6 Conclusion
This paper introduces a comparison protocol evaluating twenty feature selection techniques across four datasets sourced from the UCI machine learning repository. The experimentation assesses stability criteria and classification accuracy rates calculated using SVM-SMO. Based on this research, we have concluded that each feature selection method can be tailored to suit specific datasets, considering factors such as the number of features and their distribution in the feature space.
The classification accuracy rate and the Stability provide a good experimentation and perfect information of features, the better feature selection method is one that has the both higher accuracy rate and stability. It is very interesting to evaluate the performance of these feature selection techniques in the analysing DNA Microarrays, where there are many features and comparatively few samples.