1 Introduction
Feature selection has been a very active research field in many application [16, 14, 17]. Recently, DNA microarray technology has gained attention from biologists and scientists to improve the process of cancer diagnosis [8, 10]. DNA microarray datasets are composed of large number of genes expression and a few dozen of instances. This characteristic increases the risk of overfitting in the classification process and reduces significantly the quality of the classification model. In order to overcome this problem, it is very important to reduce the number of genes by selecting the informative subset of genes and eliminating the irrelevant and redundant genes. This preprocessing phase is called gene selection.
Gene selection or feature selection aims to select the smallest subset of genes without reduces the classification accuracy rate. It can be divided into three classes: the first one is the filter feature selection approach and it evaluates the candidate subset of genes independently of the classifier system. The second one is the wrapper approach and it uses the classifier system to compute the fitness of genes subset. The last one is the embedded feature selection approach and it incorporates the gene selection procedure in the classification system.
In this paper, we propose a novel SVM-RFE approach called SVM-RFE-ED that incorporates the energy distance to compute the class separability and to minimize the number of genes. This approach aims to select the smallest subset of genes that provides a high classification accuracy rate.
The performance assessment is demonstrated on five datasets used for cancer diagnosis
Experimental results indicate that the proposed approach SVM-RFE-ED produces very satisfac-tory results and a high classification accuracy rate. The stability of the proposed approched was demonstrated.
The rest of paper is organized as follows: In Section 2, we present and detailed the proposed approach. In Section 3, the results are critically analyzed with the existing approaches. Finally, in section 4, the conclusion and some perspectives are given.
2 The Proposed Approach SVM-RFE-ED
2.1 SVM-RFE Algorithm
SVM-RFE (Support Vector Machine - Recursive Feature Elimination), is an iterative algorithm that ranks the initial genes according to score function and eliminates the genes with the lowest scores. SVM-RFE is proposed by Guyon et al. [6], the basic idea is to train the algorithm by using SVM with some kernel function and recursively eliminates the genes using the smallest ranking score [9].
SVM [2] is one of the popular kernel-based approach used to classify the data. Mathematically, for some dataset
where
The dual problem of (1) is given as follows:
where
Kernel name | Formulation | Parameters |
Linear |
|
|
Polynomial |
|
|
Gaussian |
|
|
Multilayer Perceptron |
|
|
Quadratic |
|
|
SVM-RFE uses the coefficient vector
The general schema of SVM-RFE algorithm can be described as in Algorithm 1.
Unfortunately, the performance of SVM-RFE become unstable at some values of the gene filter out i.e. the number of gene eliminate in each iteration [11]. In addition, SVM-RFE find a combination for classification and does not take into consideration the class separability of each gene. In order to overcome this limitation, we propose to improve SVM-RFE by incorporates the energy distance that computes the measure of discrimination of each gene.
2.2 Energy Distance
Energy distance is a statistic distance between the distributions of random vectors. The origin of the name “energy” is taken from Newton gravitational potential energy and is based from the distance between two bodies [15]:
For two random vector
For multiple random vector
2.3 SVM-RFE-ED Algorithm
The proposed approach is an enhanced version of standard SVM-RFE and it incorporates the energy distance to compute the class separability. SVM-RFE-ED uses a new modified rank score defined as follows:
where
The algorithm of SVM-RFE-ED is described as follows:
To demonstrate the performance of Energy Distance we propose to compare SVM-RFE using energy distance with SVM-RFE using Hausdorff distance and Jeffries-Matusia (JM) distance.
Hausdorff distance is developed by Nadler in 1978 [12, 13] and it computes the similarity between two vectors. The basic idea of Hausdorff distance assumes that two groups are similar. For two groups
The function
Jeffries-Matusia (JM) distance is widely used for variable selection [1, 4]. For two vectors
where
3 Experimental Results
3.1 Datasets
In this section, we present the results obtained by the proposed approach. The experimentations are conducted on five datasets widely used to benchmark gene selection approaches, namely, colon cancer, leukemia cancer, lung cancer, ovarian cancer and DLBC cancer. Table 2 presents the information about the datasets used in our study.
Number of | |||
Dataset name | genes | samples | classes |
Colon | 2000 | 62 | 2 |
DLBC | 4026 | 47 | 2 |
Leukemia | 5147 | 72 | 2 |
Lung | 12533 | 181 | 2 |
Ovarian | 15154 | 253 | 2 |
The first column of table 2 presents the name of dataset. The second colon is the number of genes and the last column contains the number of samples.
3.2 Parameters Setting
We randomly split the original dataset into separate training and testing sets. Table 3 shows the number of genes and samples used for training and testing phase in each dataset.
The first column of table 3 presents the name of dataset. The second colon is the number of samples used for training and the last column contains the number of samples used for the test.
The algorithm of SVM-RFE-ED is trained by using a kernel function. In this work, we propose to use four kernel functions. Table 4 presents the information about the parameters setting of each kernel function.
Kernel name | Parameters setting |
Linear |
|
Polynomial |
|
Gaussian |
|
Multilayer Perceptron |
|
Quadratic |
|
The first column of table 4 presents the name of the kernel function and the second column is the value of the kernel parameters. These parameters have been chosen by experimentation and have proven their performance.
The parameter
3.3 Results and Discussions
The performance evaluation of the proposed approach SVM-RFE-ED is conducted in terms of: classification accuracy rate, sensitivity and specificity. Table 5 and 6 show the performance and the results obtained by the proposed approach.
Datasets | CAR | Sensitivity | Specificity |
Colon | 95,65 | 0,93 | 1 |
DLBC | 100 | 1 | 1 |
Leukemia | 100 | 1 | 1 |
Lung | 100 | 1 | 1 |
Ovarian | 100 | 1 | 1 |
Datasets | Selected Genes | Kernel |
Colon | 600 | Polynomial |
DLBC | 201 | Multilayer Perceptron |
Leukemia | 257 | Linear, Gaussian, Polynomial, Multilayer Perceptron |
Lung | 626 | Linear, Polynomial, Quadratic |
Ovarian | 757 | Linear, Gaussian, Polynomial, Quadratic |
Table 5 gives the classification accuracy rate (CAR), sensitivity and specificity of our approach for each dataset. As seen, the performance of SVM-RFE-ED was significantly better. The proposed approach has reached 100% of the clas-sification accuracy rate and significantly improved the sensitivity and specificity for DLBC, leukemia, lung and ovarian datasets.
Table 6 presents the number of selected genes and the kernel functions that have provided better results. The analysis of the results show that the proposed approach provided better results with respect to the number of selected genes. As seen, we remark that the number of selected has been significantly reduced. For DLBC, leukemia, lung and ovarian datasets the best accuracy is recorded for the
The second column of table 6 describes the best kernel functions that have provided very good results. In colon cancer, the polynomial kernel has provided good results. For DLBC, the multilayer perceptron kernel has given good results. In leukemia cancer, the linear, Gaussian, polynomial and multilayer perceptron kernel provided a
The results obtained by the proposed approach SVM-RFE-ED are summarized on the following figures.
The figures 1, 3, 2, 4 and 5 illustrate the classification accuracy rate obtained by the proposed approach for each dataset and for some percentage of selected genes. We compute the classification accuracy rate for
The proposed approach SVM-RFE-ED uses the energy distance to compute the class separability for each gene. To test the performance and the results produced by SVM-RFE using energy distance, we propose to use two other distances Hausdorff distance and Jeffries-Matusia (JM). The results are described on table 7.
SVM-RFE | |||
Datasets | SVM-RFE-ED | Hausdorff | JM |
Colon | 95,65 | 94,66 | 95,50 |
DLBC | 100 | 100 | 100 |
Leukemia | 100 | 100 | 100 |
Lung | 100 | 99,95 | 100 |
Ovarian | 100 | 99,90 | 99,98 |
The results illustrated on table 7 show that the classification accuracy rate obtained by using Hausdorff and JM distances are slightly identical compared to SVM-RFE-ED. We observe a small advantage for SVM-RFE-ED in lung and ovarian datasets.
To validate the performance and the results obtained by the proposed approach SVM-RFE-ED, we propose to compare the results of classification performances obtained by our approach with the results of seven gene selection approaches reported from [11]. Table 8 and figure 6 describes these results.
Table 8 shows the results of classification accuracy rate obtained by SVM-RFE-ED and compared to seven approaches of gene selection. The first column of table 8 represents the name of gene selection approaches. The second and the third columns are the classification accuracy rate of colon cancer and leukemia respectively.
The analysis of the results of table 8 demonstrates that the proposed approach SVM-RFE-ED provides satisfactory results and achieves a high classification accuracy rate compared to other approach. As seen, the classification performances are significantly better for the both datasets cancer and leukemia.
In order to validate the results and the perfor-mances of the proposed approach SVM-RFE-ED, we must measure its stability. The stability of feature selection method is defined as the sensitivity of a method to variations in the training set, in other term, the stability is the measure or robustness of a method when the training set is different [7]. In this study, we compute two stability measure widely used in the literature:
where:
|
are a set of selected features using different training set, |
|
is the total number of features. |
|
is the cardinality |
|
is the set-minus. |
The values of
In this study, we run the proposed approach SVM-RFE-ED
Figures 7 and 8 show the ss stability and sh stability computed for each dataset by using
4 Conclusion
In this paper, we address the problem of cancer diagnosis by solving the gene selection problem. We propose a novel SVM-RFE based on energy distance. The proposed approach was called SVM-RFE-ED and it combines the weight vector provided by SVM and the energy distance to measure the class separability of each gene. The performance evaluation has been conducted on five widely used datasets of cancer diagnosis: colon, DLBC, leukemia, lung and ovarian.
Though the results obtained by the proposed approach, we have clearly observed that SVM-RFE-ED provided very good results by reducing significantly the number of genes. In addition, the stability of SVM-RFE-ED has been demonstrated. Hence, in future we would considering the problem of genes redundancy and incorporate this problem on SVM-RFE.