ModifiedFAST: A New Optimal Feature Subset Selection Algorithm
 Author: Nagpal Arpita, Gaur Deepti
 Publish: Journal of information and communication convergence engineering Volume 13, Issue2, p113~122, 30 June 2015

ABSTRACT
Feature subset selection is as a preprocessing step in learning algorithms. In this paper, we propose an efficient algorithm, ModifiedFAST, for feature subset selection. This algorithm is suitable for text datasets, and uses the concept of information gain to remove irrelevant and redundant features. A new optimal value of the threshold for symmetric uncertainty, used to identify relevant features, is found. The thresholds used by previous feature selection algorithms such as FAST, Relief, and CFS were not optimal. It has been proven that the threshold value greatly affects the percentage of selected features and the classification accuracy. A new performance unified metric that combines accuracy and the number of features selected has been proposed and applied in the proposed algorithm. It was experimentally shown that the percentage of selected features obtained by the proposed algorithm was lower than that obtained using existing algorithms in most of the datasets. The effectiveness of our algorithm on the optimal threshold was statistically validated with other algorithms.

KEYWORD
Entropy , Feature selection , Filter model , Graphbased clustering , Mutual information

I. INTRODUCTION
Most realworld data cannot be applied directly to any data mining algorithm due to the dimensional nature of the dataset. In knowledge mining, similar patterns in a large dataset are clustered. The major difficulty faced due to an increasing number of features and, correspondingly, dimensionality, has been called ‘the curse of dimensionality’ [1]. Therefore, feature selection is an important preprocessing step before any model is applied to the data. Moreover, with an increase in the number of features, the complexity grows exponentially.
In order to reduce the dimensionality of the data, feature selection is of paramount importance in most realworld tasks. Feature selection involves removing redundant and irrelevant features, so that the remaining features can still represent the complete information. Sometimes, additional features make no contribution to the performance of the classification task. Reducing the dimensionality of a dataset provides insight into the data even before the application of a data mining process.
A subset of the features is derived using the values of the features, as calculated by mathematical criteria that provide information about the contribution of each feature to the dataset. Various evaluation criteria exist, according to which feature selection can be broadly classified into four categories: the filter approach, the wrapper approach, the embedded approach, and the hybrid approach [25]
In filter feature selection, the feature subset is selected as a preprocessing step before the application of any learning and classification processes. Thus, feature selection is independent of the learning algorithm that will be applied. This is known as the filter approach because the features are filtered out of the dataset before any algorithms are applied. This method is usually faster and computationally more efficient than the wrapper approach [2].
In the wrapper method [6], as in simulated annealing or genetic algorithms, features are selected in accordance with the learning algorithm which will then be applied. This approach yields a better feature subset than the filter approach, because it is tuned according to the algorithm, but it is much slower than the filter approach and has to be rerun if a new algorithm is applied.
The wrapper model requires the use of a predetermined learning algorithm to determine which feature subset is the best. It results in superior learning performance, but tends to be more computationally expensive than the filter model. When the number of features becomes very large, it is preferable to use the filter model due to its computational efficiency [4].
A recently proposed feature selection technique is the ensemble learningbased feature selection method. The purpose of the ensemble classifier is to produce many diverse feature selectors and combine their outputs [7].
This paper presents an algorithm that is simple and fast to execute. The algorithm works in steps. First, irrelevant features are removed using the concept of symmetric uncertainty (SU). A threshold value of SU is established, enabling irrelevant features to be removed. The threshold value employed in this algorithm was tested against various other threshold values, and we found that other threshold values led to the identification of a different number of relevant features. After obtaining a set of relevant features, a minimum spanning tree (MST) is constructed. The redundant features are removed from the MST based on the assumption that the correlation among two features should always be less than their individual correlation with the class. Features that are found to have a greater correlation with the class than their correlation with other features are added to the cluster, and other features are ignored. The next section describes the existing algorithms in this field. Section III explains the definitions and provides an outline of the work. Section IV discusses how the algorithm is implemented and analysed. Section V discusses datasets and performance metrics. Results are discussed in section VI, and the last section contains the conclusion.
II. RELATED WORK
The goal of feature selection is to determine a subset of the features of the original set by removing irrelevant and redundant features.
Liu and Yu [8] have established that the general process of feature selection can be divided into four processes: subset generation, subset evaluation, stopping criterion, and subset validation.
In the subset generation step, the starting point of the search needs to be decided first. A search strategy is then implemented: exhaustive search, heuristic search, or random search. All of the search strategies operate in three directions to generate a feature subset: forward (adding a feature to a selected subset that begins with the empty set), backward (eliminating features from a selected subset that begins with the full original set) and bidirectional (both adding and removing features). The generated subset is then evaluated based on mathematical criteria. These mathematical criteria are divided into the filter, wrapper, hybrid, and embedded approaches. Within the filter model, many algorithms have been proposed, including Relief [9], ReliefF, Focus, FOCUS2 [10], correlationbased feature selection (CFS) [11], fastcorrelationbased filter (FCBS) [12], and FAST [13]. Some of these algorithms, including Relief, FCBS, and FAST, employ a threshold to decide whether a feature is relevant.
The Relief algorithm for feature selection uses distance measures as a feature subset evaluation criterion. It weighs (ranks) each feature under different classes. If the score exceeds a specified threshold, then the feature is retained; otherwise it is removed [9].
CFS algorithms are based on the concept of information theory. The FCBS algorithm developed by Yu and Liu [12] removes both irrelevant and redundant features, using SU as a goodness measure. It uses the concept of correlation based on the information theoretic concept of mutual information and entropy to calculate the uncertainty of a random variable.
Hall and Smith [3] have developed a CFS algorithm for feature selection that uses correlation to evaluate the value of features. This approach is based on the idea that good feature subsets contain features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other [11]. In this algorithm, it is necessary to find the best subset from the 2
^{n} possible subsets of then features. The subset generation strategy used is the bestfirst search. In this strategy, the feature class and the featurefeature correlation matrix is given as input. Initially, the feature set is empty; as each feature is added, its value is evaluated. The feature set with the highest evaluation is chosen. Subsequently, the next subsets are added to the best one and evaluated. Finally, the best subset found is returned [11]. This approach uses Pearson correlation as a heuristic for calculating the value of each feature subset that is generated. The search terminates when the best subset is found. It stops when five consecutive fully expanded nonimproving subsets are found.The subset generation and evaluation process terminates when it reaches the stopping criterion. Different stopping criteria are used in different algorithms. Some involve a specified limit on the number of features or a threshold value. Finally, validation data are used to validate the selected subset. In realworld applications, we do not have any prior knowledge of the relevance of a feature set. Therefore, observations are made of how changes in the feature set affect the performance of the algorithm. For example, the classification error rate or the proportion of selected features can be used as a performance indicator for a selected feature subset [8].
The adjusted Rand index, which is a clustering validation measure, has been used as a measure of correlation between the target class and a feature. Such metrics rank the features by making a comparison between the partition given by the target class and the partition of each feature [14].
General graphtheoretic clustering uses the concept of relative neighbourhood graphs [15,16]. Using the concept of neighbourliness, we define RNG(
V ) as the representative of the family of the graph consisting of a finite set of pointsV . RNG(V ) is called the relative neighbourhood graph.Some clusteringbased methods construct a graph using the knearestneighbour approach [17]. Chameleon, which is a hierarchical clustering algorithm, uses a knearestneighbour graph approach to construct a sparse graph. It then uses an algorithm to find a cluster by repeatedly combining these subclusters. Chameleon is applicable only when each cluster contains a sufficiently large number of vertices (data items).
According to Xu et al. [18], the MST clustering algorithm has been widely used. They have used this approach to represent multidimensional gene expression data. It is not preassumed that the data points are separated by a regular geometric curve or are grouped around centres in any MSTbased clustering algorithm.
Many graphbased clustering methods take advantage of the MST approach to represent a dataset. Zhong et al. [17] employed a tworound MSTbased graph representation of a dataset and developed a separated clustering algorithm and a touching clustering algorithm, encapsulating both algorithms in the same method. Zahn [19] divided a dataset into different groups according to structural features (distance, density, etc.) and captured these distinctions with different techniques. Grygorash et al. [20] proposed two algorithms using MSTbased clustering algorithms. The first algorithm assumes that the number of clusters is predefined. It constructs an MST of a point set and uses an inconsistency measure to remove the edges. This process is repeated for
k clusters and forms a hierarchy of clusters untilk clusters are obtained. The second algorithm proposed partitions the point set into a group of clusters by maximizing the overall standard deviation reduction.Shi and Malik [21] published a normalized cut criterion. Ding et al. [22] proposed a minmax clustering algorithm, in which the similarity or association between two subgraphs is minimized, while the similarity within each subgraph is maximized. It has been shown that the minmax cut leads to more balanced cuts than the ratio cut and normalized cut [21].
Our algorithm uses Kruskal’s algorithm, which is an MSTbased technique of for clustering features. The clusters formed are not based on any distance or density measures. A feature is added to the cluster if it is not redundant to any of the features already present in it.
III. FEATURE SUBSET SELECTION METHOD
Entropy feature selection is a preprocessing step in many applications. Subsets are generated and then evaluated based on a heuristic. The main aim is to find the subset which has relevant and nonredundant features. In order to evaluate and classify features, we have used the information theoretic concept of information gain and entropy. Entropy is a measure of the unpredictability of a random variable. Entropy is related to the probabilities of the random variable rather than their actual values. Entropy is defined as:
Here,
X is a discrete random variable with the alphabetZ and the probability mass function p(x ) = Pr{X =x },x ϵZ .If
X andY are discrete random variables, Eqs. (2) and (3) give the entropy ofY before and after observingX .The amount by which the entropy of
Y decreases reflects the additional information aboutY provided byX and is called the information gain [23]. Information gain is a measure of the amount of information that one random variable contains about another random variable. The information gain I(X, Y ) is the relative entropy between the product distribution and the joint distribution [24].Information gain is the amount of information obtained about
X after observingY , and is equal to the amount of information obtained aboutY after observingX . Unfortunately, information gain is biased in favour of features with many outcomes. That is, attributes with many diverse values will appear to gain more information than those with less diverse values even if they are actually no more informative. Symmetrical uncertainty [25] compensates for the bias of information gain toward attributes with more values and normalizes its value to the range [0,1]. SU is defined as:Sotoca and Pla [26] have used conditional mutual information as an information theoretic measure. Conditional mutual information I(
X ;Y /Z ) can be defined with regard to two random variablesY andZ , over the same dataset, and the relevant variableY , representing the class labels. I(X ;Y /Z ) can be interpreted as how much information the feature spaceX can predict about the relevant variableY that the feature spaceZ cannot [24].> A. Outline
The procedure of removing irrelevant and redundant features is shown in Fig. 1.
The goal of obtaining relevant features is accomplished by defining an appropriate threshold for the SU value. If the attributes from the table meet this condition, then they considered relevant and kept. Subsequently, pairwise correlation is performed to remove redundant features and a complete weighted graph is created. An MST is then constructed. If features meet the criterion defined by the threshold value, they are added to the cluster of the feature subset, and otherwise are ignored.
IV. METHODOLOGY AND ANALYSIS
Our algorithm uses the concept of SU described above to solve the problem of feature selection. It is a twostep process; first, irrelevant features are removed, and then redundant features are removed from the dataset and the final feature subset is formed.
The irrelevant feature removal process evaluates whether a feature is relevant to the class or not. For each data set in Table 1, the algorithm given in Fig. 2 is applied to determine the relevant features corresponding to different threshold values. The input to this algorithm is the dataset with
m features, data = {F_{0},F_{1},F_{2},………F_{m} }, the classC , and a threshold value. For each feature, information gain and SU for the class SU(F_{i} , C) is determined according to Eqs. (5) and (6). In order to determine the relevance of a feature, a userdefined threshold is applied on the SU value [4]. A subsetFS of relevant features can be determined by a threshold on the SU value (θ ), such that F_{i} ϵ S’, 1<=i <=h , Su_{i,c} > =θ . A relevant feature subset FS={F_{0}’,F_{1}’,F_{2}’, ……..F_{h} ’} is formed, whereh <m .Here, for the input dataset
D , the algorithm starts with the empty set as the starting point and then adds features to it. It uses a heuristic or sequential search strategy for subset generation. Each subset generated is evaluated by SU. The search iterates and keeps all relevant features found as subsets if they meet a predefined threshold. The algorithm outputs a feature subset.Time Complexity: The time complexity of the method applied above is quite low. It has linear complexity and is dependent on the size of the data, as defined by the number of features and number of instances in the dataset. The time complexity is O(m ), if the number of features ism . Each feature is referred once and its relevance is checked according to the threshold value.In order to remove redundant features from the subset obtained above, a graphbased clustering method is applied. The vertices of the graph are the relevant features and the edges represent the correlation between two features. In order to compute the correlation among two features (
f_{i} ,f_{j} ϵFS ), the correlation SU (f_{i} ,f_{j} ) is calculated. This SU(f_{i} ,f_{j} ) is calculated for all possible feature pairf_{i} andf_{j} wherei ≠j . Therefore, the value of each edge is the correlation value, and the value of each node in the graph is the value of SU calculated above for each feature with its class label.This results in a weighted complete graph with
h (h −1)/2 edges (whereh is the number of nodes), and the correlation value is expressed as the weights on the edges. In the case of highdimensional data where the numbers of vertices are very large, the complete graph is very dense with a large number of vertices and edges. For further work on this dataset to be feasible, the number of edges needs to be reduced. An MST is constructed to make the graph somewhat less dense. The edges are removed using Kruskal’s algorithm. In order to remove the redundant features from the spanning tree, only features with an edge weight less than the correlation of both features with the class are added to the cluster. The algorithm is described in Fig. 3.V. EMPIRICAL STUDY
In this section, we evaluate the performance of the algorithm described above by testing it on eight datasets of varying dimensionality, ranging from lowdimension datasets to largedimension datasets. The ModifiedFAST algorithm was evaluated in terms of the percentage of selected features, runtime and classification accuracy.
> A. Data Set
In order to validate the proposed algorithm, eight samples of real datasets with different dimensionality were chosen from the UCI machine learning repository and from featureselection.asu.edu/datasets. The datasets contain text, microarray, or image data. The details of these datasets are described in Table 1.
> B. Performance Parameters
Many feature selection algorithms are available in the literature. Some algorithms perform better than others with regard to individual metrics, but may perform less well from the viewpoint of other metrics.
Some classical methods used as performance metrics are:
For each dataset, we obtained the number of features after running the algorithm. Different feature selection algorithms were compared on the basis of the percentage of selected features. We then applied the naïve Bayes, the treebased C4.5, and the rule based IB1 classification algorithms on each newly generated feature subset to calculate the classification accuracy by 10fold cross validation.
The main target of developing a feature subset for the classification task is to improve the performance of classification algorithms. Accuracy is also the main metric for measuring classification performance. However, the number of selected features is also important in some cases.
We have developed a new performance evaluation measure resulting in a formula that integrates the number of features selected with the classification accuracy of an algorithm.
Δ denotes the classification accuracy/number of feature selected. The number of features selected is a negative metric and the classification accuracy is a positive metric, such that values ofΔ > 1 indicate a better algorithm.> C. Statistical Significance Validation
Performance metrics are necessary to make observations about different classifiers. The purpose of statistical significance testing is to collect evidence representing the general behaviour of the classifiers from the results obtained by an evaluation metric. One of the tests used is the Friedman test.
The Friedman test [28] is a nonparametric approach. It can be used as a measure to compare the rank of
k algorithms overd datasets. It provides a test of significance for data with ranks <6. If the value ofk is >5, then the level of significance or the rank of the algorithm can be seen in the chisquare distribution table. The data are treated as the matrix {x_{ij} }_{dxk} , whered is the number of datasets (called blocks) andk is the number of columns that have different algorithms.The null hypothesis of the Friedman test considered here is that there is no difference between the feature selection algorithms based on the percentage of selected features. The decision rule then states that the null hypothesis should be rejected if
M is greater than the critical value. If this hypothesis gets rejected, then a posthoc test is needed to compare the performance of the algorithms. The Nemenyi test or the BonferroniDunn test can be used as the posthoc test.VI. RESULTS AND ANALYSIS
We used several threshold values in our experiment. Different feature selection algorithms, such as FCBF [12], Relief, CFS [11], and FAST [13] have likewise used different threshold values to evaluate their results. Yu and Liu [12] have suggested the relevant threshold value on SU to be the ˪m/log m˩th ranked feature for each dataset. In FAST [13] feature selection, features are ranked by using the (√
m * lgm )th ranked feature as the threshold of the SU value, wherem is the number of features. Table 2 shows the relevant features found after implementing the filter feature selection algorithm described in Fig. 2 with a range of threshold values.The algorithm given in Fig. 3 was applied on all the features found in Table 2 for each dataset. The final feature subset was recorded. Table 3 shows the different percentages corresponding to different thresholds. In order to choose the best thresholds for this algorithm can work, the percentage of selected features, runtime, and classification accuracy values for each threshold value were observed. Tables 4–6 record the classification accuracy along with the calculations of the proposed performance parameter
Δ for each dataset, using the naïve Bayes classifier, the treebased C4.5 classifier, and the instancebased IB1 classifier, respectively.Tables 2–6 prove that different threshold values result in different numbers of features selected for the same dataset. In Tables 4–6, all values for each threshold have not been displayed due to space considerations. We observed experimentally that a threshold value of 0.2 yields optimal results. We also found that after feature selection, all three classifiers result in good classification accuracy for text datasets.
The datasets were categorized as lowdimension (
D < 200) and largedimension (D > 200). Tables 7 and 8 compare the results obtained with a threshold value of 0.2 using this algorithm with the results obtained using other existing feature selection algorithms.When different algorithms are evaluated in terms of the percentage of features selected, we conclude that the ModifiedFAST feature selection algorithm obtains the least percentage of features in most of the datasets. This is shown in Fig. 4. ModifiedFAST yielded better results than other algorit hms in many of the datasets. The FAST algorithm took second place, followed by FCBF, CFS, and ReliefF. ModifiedFAST had the highest performance in the Chess, coil2000, WarpPIE10p, and WarpAR10P datasets. The proportion of selected features was improved in all these datasets. This shows that the optimal threshold value of 0.2 greatly improves the performance of the classification algorithms. In order to further rank the algorithms based on statistical significance, the Friedman test is used. It is used to compare
k algorithms overd datasets by ranking the algorithms. The value ofM given in Eq. (7) is calculated and compared with the critical value atα = 1%.The test results falsified the null hypothesis, showing that all feature selection algorithms performed differently in terms of the percentage of selected features.
We then applied the Nemenyi test as a posthoc test [29] to explore the actual range by which pairs of algorithms differed from each other. As stated by the Nemenyi test, two classifiers are found to perform differently if the corresponding average ranks (
R_{x} R_{y} , whereR_{x} andR_{y} are the average ranks of algorithmsx andy , respectively) differ by at least the critical difference (CD).In Eq. (8),
k is the number of algorithms,N is number of datasets, and _{∝} is based on the Studentized range statistic divided by √2. Fig. 5 shows the results for these five algorithms with ∝ = 0.1 on seven datasets. The CD value is compared with the mean rank of each algorithm using ModifiedFAST. Overall, we found that the rank of ModifiedFAST was slightly higher than FAST, and much higher than other algorithms.q VII. CONCLUSION
This paper finds an optimal threshold value that can be used for most feature selection algorithms, resulting in good subsets. The thresholds given by earlier feature selection algorithms are not optimal and need to be changed. We observed that different threshold values can result in different numbers of features selected and a different level of accuracy due to changes in the number of selected features. The best threshold found for the proposed algorithm was 0.2.
We have compared the performance of the Modified FAST algorithm, based on the percentage of features selected and classification accuracy, with the values given by other feature selection algorithms, such as FAST, ReliefF, CFS, and FCBF. For text datasets, we found that the proposed algorithm resulted in improved classification accuracy than it was before the algorithm was applied for all the classifiers. On the basis of the proposed performance parameter
Δ , we observed that all values were positive for all three classifiers, demonstrating that ModifiedFAST is a better algorithm as assessed by a combination of classification accuracy and the number of selected features.

[]

[]

[]

[]

[]

[]

[Fig. 1.] Flowchart of feature selection [27].

[Table 1.] Summary of data sets

[Fig. 2.] ModifiedFAST algorithm for removing irrelevant features.

[Fig. 3.] ModifiedFAST algorithm for redundant feature removal.

[]

[Table 2.] The number of relevant features found using different threshold values

[Table 3.] Percentage of selected features using different threshold values

[Table 4.] Accuracy calculated using the naive Bayes classifier

[Table 5.] Accuracy calculated using the C4.5 classifier

[Table 6.] Accuracy calculated using the IB1 classifier

[Table 7.] Comparison of feature selection algorithms for lowdimensional datasets

[Table 8.] Comparison of feature selection algorithms for largedimensional datasets

[Fig. 4.] Comparison of five feature selection algorithms based on the percentage of features selected.

[]

[Fig. 5.] Ranking of feature selection algorithms based on the Nemenyi test.