Hybrid Feature Selection Using Genetic Algorithm and Information Theory
 Author: Cho Jae Hoon, Lee DaeJong, Park JinIl, Chun MyungGeun
 Organization: Cho Jae Hoon; Lee DaeJong; Park JinIl; Chun MyungGeun
 Publish: International Journal of Fuzzy Logic and Intelligent Systems Volume 13, Issue1, p73~82, 25 March 2013

ABSTRACT
In pattern classification, feature selection is an important factor in the performance of classifiers. In particular, when classifying a large number of features or variables, the accuracy and computational time of the classifier can be improved by using the relevant feature subset to remove the irrelevant, redundant, or noisy data. The proposed method consists of two parts: a wrapper part with an improved genetic algorithm(GA) using a new reproduction method and a filter part using mutual information. We also considered feature selection methods based on mutual information(MI) to improve computational complexity. Experimental results show that this method can achieve better performance in pattern recognition problems than other conventional solutions.

KEYWORD
Pattern classification , Feature selection , Mutual information , Computational complexity

1. Introduction
Feature selection algorithms can be categorized based on
subset generation andsubset evaluation [1].Subset generation is a search procedure that selects candidate feature subsets, based on certain search strategies, such as complete search, sequential search, and random search.Subset evaluation is a set of evaluation criteria used to evaluate a selected feature subset. The criteria can be categorized into two groups based on their dependency on inductive algorithms: independent criteria and dependent criteria. Some independent criteria include distance measures, information measures, dependency measures, and consistency measures [25]. A dependent criterion requires a predetermined inductive algorithm in feature selection. Based on the performance of the inductive algorithm applied on the selected subset, it determines which features are selected. Under evaluation criteria, algorithms are categorized into filter, wrapper, and hybrid. Filter methods are independent of the inductive algorithm and evaluate the performance of the feature subset by using the intrinsic characteristic of the data. In the filter methods, the optimal features subset is selected in one pass by evaluating some predefined criteria. Therefore, filter methods have the ability to quickly compute very highdimensional datasets; however, they also have the worst classification performance, because they ignore the effect of the selected feature subset on the performance of the inductive algorithm. The wrapper methods utilize the error rate of the inductive algorithm as the evaluation function. They search for the best subset of features in all available feature subsets. Wrapper methods are generally known to perform better than filter methods.Information theory has been applied to feature selection problems in recent years. Battiti [6] proposed a feature selection method called mutual information feature selection (MIFS). Kwak and Choi [7] investigated the limitation of MIFS using a simple example and proposed an algorithm that can overcome the limitation and improve performance. The main advantage of mutual information methods is the robustness of the noise and data transformation. Despite these advantages, the drawback of feature selection methods based on mutual information is the slow computational speed due to the computation of a highdimensional covariance matrix. In pattern recognition, feature selection methods have been applied to various classifiers. Mao [8] proposed a feature selection method based on pruning and support vector machine (SVM), and Hsu et al. [9] proposed a method called artificial neural net input gain measurement approximation (ANNIGMA) based on weights of neural networks. Pal and Chintalapudi [10] proposed an advanced online feature selection method to select the relevant features during the learning time of neural networks.
On the other hand, the techniques of evolutionary computation, such as genetic algorithm and genetic programming, have been applied to feature selection to find the optimal features subset. Siedlecki and Sklansky [11] used GAbased branchand bound technique. Pal et al. [12] proposed a new genetic operator called selfcrossover for feature selection. In the genetic algorithm (GA)based feature selection techniques, each chromosomal gene represents a feature and each individual represents a feature subset. If the
i th gene of the chromosome equals 1, then thei th feature is selected as one of the features used to evaluate a fitness function; if the chromosome is 0, then the corresponding feature is not selected. Kudo and Sklansky [13] compared a GAbased feature selection with many conventional feature selection methods, and they showed that GAbased feature selection performs better than others for highdimensional datasets.In this paper, we propose a feature selection method using both information theory and genetic algorithm. We also considered the performance of each mutual information (MI)based feature selection method to choose the best MIbased method to combine with genetic algorithm. The proposed method consists of two parts: the filter part and the wrapper part. In the filter part, we evaluated the significance of each feature using mutual information and then removed features with low significance. In the wrapper part, we used genetic algorithm to select the optimal feature subsets with smaller sizes and higher classification
performance, which is the goal of the proposed method. In order to estimate the performance of the proposed method, we applied our method on University of CaliforniaIrvine (UCI) machinelearning data sets [14]. Experimental results showed that our method is effective and efficient in finding small subsets of the significant features for reliable classification.
2. Mutual InformationBased Feature Selection
2.1 Entropy and Mutual Information
Entropy and mutual information are introduced in Shannon’s information theory to measure the information of random variables [15]. Basically, mutual information is a special case of a more general quantity called relative entropy, which is a measure of the distance between two probability distributions. The entropy is a measure of uncertainty of random variables. More specifically, if a discrete random variable
X hasλ alphabets with its probability density function denoted asp (x ) = Pr{X =x },x ∈λ , then the entropy ofX can be defined asThe joint entropy of two discrete random variables
X andY is defined as follows:where
p (x ,y )denotes the joint probability density function ofX andY . When some variables are known and the others are not, the remaining uncertainty can be described by the conditional entropy, which is defined asThe common information of two random variables X and Y is defined as the mutual information between them:
A large amount of mutual information between two random variables means that the two variables are closely related; otherwise, if the mutual information is zero, then the two variables are totally unrelated or independent of each other. The relation between the mutual information and the entropy can be described in (5), which is also illustrated in Figure 1.
In feature selection problems, the mutual information between two variables feature
F and classC is defined in terms of their probabilistic density functionsp (f ),p (c ), andp (f ,c ):If the mutual information
I (F ;C ) between featureF and classC is large, it means that featureF contains much information about classC . IfI (F ;C ) is small, then featureF has little effect on output classC . Therefore, in feature selection problems, the optimal feature subset can be determined by selecting the features with higher mutual information.2.2 Mutual InformationBased Feature Selection
2.21 Feature Selection Problem with Mutual Information
Feature selection is a process that selects a subset from the complete set of original features. It selects the feature subset that can best improve the performance of a classifier or an inductive algorithm. In the process of selecting features, the number of features is reduced by excluding irrelevant or redundant features from the ones extracted from the raw data. This concept is formalized as selecting the most relevant
k features from a set ofn features. Battiti [6] named this concept a “feature reduction” problem.Let the FRn
k problem be defined as follows:Given an initial set of
n features, find the subset withk <n features, such that the subset is “maximally informative” about the class.In information theory, the mutual information between two random variables measures the amount of information commonly found in these variables. The problem of selecting input features that contain the relevant information about the output can be solved by computing the mutual information between input features and output classes. If the mutual information between input features and output classes could be obtained accurately, the FRn
k problem could be reformulated as follows:Given an initial set
F withn features and aC set of all output classes, find the subsetS ⊂F withk features such that the subset minimizesH (C S ) and maximizes the mutual informationI (C ;S ).To solve this FRn
k problem, we can use three strategies: complete search, random search, and sequential search. Complete search guarantees to find the optimal feature subset according to an evaluation criterion. This strategy evaluates all possible subsets to guarantee completeness; however, it is almost impossible due to the large number of combinations. Random search starts with a randomly selected subset and proceeds in two different ways. One is to follow a sequential search, which injects randomness into the sequential approaches. The other is to generate the next subset in a completely random manner. The use of randomness helps to escape local optima in the search space. Sequential search gives up completeness and therefore risks losing optimal subsets. Many variations of the greedy algorithm to the sequential search are available, including sequential forward selection and sequential backward elimination. All these approaches add or remove features one at a time. Algorithms with sequential search are simple to implement and quickly produce results, because the order of the search space is usuallyO (N ^{2}) or less.Typically, the mutual informationbased feature selection is performed by sequential forward selection. This method starts with an empty set of selected features, and then we add the best available input feature to the selected feature set one by one until the size of the set reaches
k . This ideal sequential forward selection algorithm using mutual information is implemented as follows:1. (Initialization) Set F ← “initial set of n features,” S ← “empty set.”
2. (Computation of the MI with the output class) If ∀fi∈ F, compute I(C; F).
3. (Selection of the first feature) Select a feature that maximizes I(C; F), and set F ← F？{fi},S ← {fi}
4. (Greedy sequential forward selection) Repeat until the desired number of selected features is reached.
5. (Computation of the joint MI between variables) If ∀fi∈ F, computeI(C; fi,S).
6. (Selection of the next feature) Choose the feature fi ∈ F that maximizes I(C; fi,S) and set F ← F ？{fi}, S ← {fi}. Output is the set S containing the selected features.
where
C is a class,f_{i} is thei th feature, andS is a feature subset.2.22 Battiti’s Mutual InformationBased Feature Selection
In the ideal sequential forward selection, we must estimate the joint mutual information between variables
I (C ;f_{i} ,S ) and know the probabilistic density functions of variables to computeI (C ;f_{i} ,S ). However, it is difficult to compute probabilistic density functions of highdimension data; therefore, we use a histogram of data.In selecting
k features, if the output classes are composed ofK_{c} classes and we divide thej th input feature space intoP_{j} partitions to get the histogram, we must havecells to compute
I (C ;f_{i} ,S ).Because of this requirement, implementing the ideal sequential forward selection algorithm is practically impossible. To overcome this practical problem, Battiti [6] used only
I (C ;f_{i} ) andI (f_{i} ,f_{s} ), instead of calculatingI (C ;f_{i} ,S ). The mutual informationI (C ;f_{i} )indicates the relative importance of the input featuref_{i} , which was estimated based on the mutual information between the input featuref_{i} and the output classC .I (f_{i} ,f_{s} ) indicates the redundancy between the input featuref_{i} and the alreadyselected featuresf_{s} . Battiti’s algorithm, also known as MIFS, is essentially the same as the ideal greedy sequential forward selection algorithm, except for Step 4, which was replaced in MIFS as follows [6]:4) (Greedy sequential forward selection) Repeat until the desired number of selected features is reached.
a) (Computation of the MI between variables) For all couples of variables (
f_{i} ,f_{s} ) withf_{i} ∈F ,f_{s} ∈S , computeI (f_{i} ,f_{s} ), if it is not yet available.b) (Selection of the next feature) choose the feature
f_{i} ∈F that maximizesI (C ;f_{i} ) ？β Σ_{fs∈S}I (f_{i} ,f_{s} ), and setF ←F ？{f_{i} },S ← {f_{i} }.where
β regulates the influence of the redundancy of input featuref_{i} . Ifβ = 0, the redundancy among features is not taken into consideration, and the algorithm selects features in the order of relative importance estimated by the mutual information between input features and output classes.3. Genetic AlgorithmBased Feature Selection
Genetic algorithm is one of the bestknown techniques for solving optimization problems. It is also a search method based on a random population. First, genetic algorithm randomly encodes the initial population, which is a set of created individuals. Each individual can be represented as bit strings that can be constructed using all possible permutations in a potential solution space. At each step, the new population is determined by processing the chromosome of the old population in order to obtain the best fitness in a given situation. This sequence continues until a termination criterion is reached. The chromosome manipulation is performed using one of three genetic operators: crossover, mutation, and reproduction. The selection step determines which individuals will participate in the reproduction phase. Reproduction itself allows the exchange of already existing genes, whereas mutation introduces new genetic material, where the substitution defines the individuals for the next population. This process efficiently provides optimal or nearoptimal solutions.
In the genetic algorithmbased feature selection, the size of chromosome
n represents the total number of featuresN , and a gene represents a feature with values “1” and “0” meaning selected and removed, respectively. Therefore, we can define genetic algorithmbased feature selection as finding the optimal feature subset with the smallest number of genes set to “1” and with a higher classification performance.A generational procedure GA is shown below:
4. Proposed Method by Mutual Information and Genetic Algorithm
The proposed method can be divided into two parts. The filter part is a preprocessing step to remove irrelevant, redundant, and noise features according to ranking by mutual information. In the wrapper part, the optimal feature subset is selected from the preprocessed features using genetic algorithm with the fitness function based on the number of features and on the accuracy of classification. Figure 2 shows the structure of the proposed method, described as follows.
4.1 Filter Part by Mutual Information
[Step 1]
Evaluate mutual information . We determine the mutual information between each featureF and each classC by (6).[Step 2]
Select the topranked features . Each feature is ranked using the evaluated mutual information. Then, we select the topranked features with higher mutual information to use as candidate individuals for the genetic algorithm. To determine the number of topranked features, we categorized the features into three types: fulltopranked, halftopranked, andU topranked. Full and halftopranked can be used on data with a small feature size, andU topranked can be used on data with ahigh feature size, where
U is the selection rate determined by the user.4.2 Wrapper Part by Genetic Algorithm
[Step 3]
Set initial parameters of GA and generate initial population . Set the initial parameters of GA, such as population size, probability of crossover and mutation, the number of generation, weight of fitness, and initial population rate. We also generate the initial population from the features selected in Step 2. In genetic algorithmbased feature selection methods, each chromosome is represented by an nbit binary for an n？dimensional feature space {b _{1},b _{2}, · · · ,b_{n} }. If thei th feature is present in the feature subset represented by the chromosome, thenb_{i} = 1; otherwise,b_{i} = 0.An example of a simple procedure is as follows:
where 
P  is the population size,random _number is a function that generates a random floating number within [0,1], andα is the expected number of selected features. Figure 3 shows the structure of ann dimension binary chromosome.[Step 4]
Evaluate fitness function of initial generation . Evaluate the fitness values of all individuals in the population using a classifier to evaluate each chromosome (the selected feature subset) based on the classification accuracy and the dimension of the feature subset. Tan et al. [14] proposed the following fitness function in order to optimize two objectives: maximize the classification accuracy of the feature subset and minimize the size of the feature subset.where
z is a feature vector representing a selected feature subset, andλ is a weight value between 0 and 1. The function is composed of two parts. The first part is the weighted classification accuracyacc (z )from a classifier and the second part is the weighted sizefeats (z )of the feature subset represented byz . In order to get a balance between accuracy and dimensionality reduction, the following fitness function is proposed:where
total _feat is the maximum number of features for the problem. In (8) and (9), in order to obtain the classifier with the best accuracy and the smallest size,λ is set to the rate 0.5.[Step 5]
Perform selection, crossover, and mutation step . To produce the new feature subsets, these operators are carried out by the GA: selection operator, crossover operator, and mutation operator. The selection operator selects new feature subsets based on the fitness value of each feature, and then the crossover and mutation operators create the next generation feature subsets.[Step 6]
Evaluate the fitness function of the next generation .[Step 7]
Perform termination test . If a predefined generation is satisfied, then stop the algorithm. Otherwise, go to Step 5. In this study, the termination condition required only one generation.5. Experiments and Discussions
Tenfold crossvalidation procedure is commonly used to evaluate the performance of knearest neighbor algorithm(
k NN) with 1nearest neighbor. In the 10fold crossvalidation, the selected feature subsets are partitioned into ten. To test the MLP, one feature subset is retained as the validation data, and the remaining nine feature subsets are used as training data. The crossvalidation process is repeated 10 times, and the 10 sets of results can be averaged to produce a single estimate. Inthe filter part, we used
0.7 topranked feature. Table 1 shows the parameters for the genetic algorithm and Table 2 shows the UCI datasets used in this experiment. The first two rows are artificial datasets and the rest are realworld datasets.In this experiment, we used the fitness values evaluated by using (9) and the three genetic operators: roulettewheel selection, uniform crossover, and simple mutation.
Figures 4 and 5 show the fitness values of GA in each generation and the number of selected features in each generation. We can see that the optimal feature subset was effectively found by the proposed method.
In Table 3, all the results of different methods are obtained from [9]. The NN column lists the results with no feature subset selection. The second column shows the results of the standard wrapperbackward elimination method. The third column shows the results of the ANNIGMAwrapper method proposed by Hsu et al. The fourth column represents the results of GA, and the final column shows the results of the proposed method. For each error and each number of selected features, we include the average and the standard deviation. As shown in Table 3, the proposed method shows better performance than the other methods for most datasets with small features. More specifically, the error rate is 4.2% when using the eight dominant features chosen bysed method, whereas the error rate is 11.4% for NN without feature selection. From this result, one can see thatsed method makes it possible to dramatically decrease the error.
6. Conclusion
The feature selection methods can be divided into two groups, filter method and wrapper method, based on their dependence and independence on the inductive algorithm. Filter methods
have fast computational ability because the optimal feature subset is selected in one pass by evaluating some predefined criteria. However, they have the worst classification performance, because they ignore the effect of the selected feature subset on the performance of the inductive algorithm. The wrapper methods have higher performance than the filter methods, whereas they have high computational cost.
In order to overcome the drawbacks of both filter methods and wrapper methods, we propose a feature selection method using both information theory and genetic algorithm. The proposed method was applied to UCI datasets and some gene expression datasets. For the various experimental datasets, the proposed method had better generalization performance than previous ones. More specifically, the error rate is 4.2% when using the eight dominant features chosen by the proposed method, whereas the error rate is 11.4% for NN without feature selection. From these results, one can see that the proposed method makes it possible to dramatically decrease the error without increasing the computational time.
> Conflict of Interest
No potential conflict of interest relevant to this article was reported

[Figure 1.] Relation between entropy and mutual information.

[Figure 2.] Flowchart of the proposed method. GA, genetic algorithm; kNN, knearest neighbor algorithm.

[Figure 3.] ndimension binary chromosome.

[Table 1.] Parameters for the genetic algorithm

[Table 2.] UCI datasets and classifiers used in this experiment

[Table 3.] Comparison results between the proposed and other methods

[Figure 4.] Fitness values of the genetic algorithm. (a) Breast cancerWisconsin data, (b) credit data, (c) monk3b data, (d) ionosphere data.

[Figure 5.] The number of selected feature in each generation. (a) Breast cancerWisconsin data, (b) credit data, (c) monk3b data, (d) ionosphere data.