An Optimal Weighting Method in Supervised Learning of Linguistic Model for Text Classification
 Author: Mikawa Kenta, Ishida Takashi, Goto Masayuki
 Organization: Mikawa Kenta; Ishida Takashi; Goto Masayuki
 Publish: Industrial Engineering and Management Systems Volume 11, Issue1, p87~93, 01 March 2012

ABSTRACT
This paper discusses a new weighting method for text analyzing from the view point of supervised learning. The term frequency and inverse term frequency measure (
tfidf measure) is famous weighting method for information retrieval, and this method can be used for text analyzing either. However, it is an experimental weighting method for information retrieval whose effectiveness is not clarified from the theoretical viewpoints. Therefore, other effective weighting measure may be obtained for document classification problems. In this study, we propose the optimal weighting method for document classification problems from the view point of supervised learning. The proposed measure is more suitable for the text classification problem as used training data than thetfidf measure. The effectiveness of our proposal is clarified by simulation experiments for the text classification problems of newspaper article and the customer review which is posted on the web site.

KEYWORD
Text Classification , Weighting Method , Vector Space Model , Cosine Similarity

1. INTRODUCTION
Due to development of information technology, the effectiveness of knowledge discovery from enormous document data is suggested in much of the literatures on this subject (Hearst. 1999). There are many web sites where customers can post their free comments about merchandise that they bought and used. On the internet, the number of customer reviews is increasing day by day. Therefore it has been easy to get a large amount of document data and analyze it for several purposes. Customer reviews consist of not only free comments but customer information and the degree of satisfaction about items as metadata. The analysis using the metadata is more helpful for knowledge discovery than using only text data. The techniques for text mining are developed for the purpose of getting information. Various methods have been proposed in this research field, for example, vector space model (Manning
et al. , 2008), (Mikawaet al. , 2012), probabilistic model (Hofmann. 1999), (Bishop, 2006) and so on.In this paper, a vector space model is the focus for document analysis. To construct a vector space model for document analysis, the documents are separated into terms or words (morphemes) by using the morphological analysis (Nagata. 1994). After that, each document is represented by a vector whose elements express the information of word frequency of appearance. Because the vector space is built by the information of word frequency, the characteristics of a document vector model should be remarkable: high dimension and sparseness. Generally speaking, untold thousands of words or more should be treated to represent a document vector using effective words appearing in all documents.
As mentioned above, there are enormous words which are appeared in whole documents. In addition, term frequency of each word varies widely in length. Therefore, the performance of text analyzing depends on term frequency of words which is appeared each documents. That is, it depends on the length of documents. To avoid this, several weighting approach for each word has been proposed. For instance,
tf idf weighting (Saltonet al. , 1988), PWI (Probabilityweighted amount of information) (Aizawa, 2000, 2003), mutual information (McCallumet al. , 1998) and so on. Andtf idf weighting is one of the most famous method for weighting terms. However, it is proposed for information retrieval and the effectiveness is empirically shown. Therefore, the theoretical optimality is not proved. In addition, it doesn't use the metadata or side information for weighting each word. Nowadays, it can be easy to get or use those, and by using that information, it supposes to improve the performance of each analysis.From above discussion, the purpose of this study is to propose a new weighting method for each word from the view point of supervised learning. We show the way of estimating an optimal word weighting by solving maximization problems. The effectiveness of this method is clarified by case experiments of applications to the customer review which are posted on web sites and newspaper articles which are used as a bench mark data.
In section 2, basic formulation of vector space model and weighting methods which have already proposed are explained. In section 3, the proposed method of weighting each word and the way of its estimation is explained. The illustration of simulation experiments in order to clarify the effectiveness of our proposal and the results acquired from the experiments are explained in section 4. Finally, the conclusion of this study is stated in section 5.
2. BASIC INFORMATION FOR ANALYSIS OF TEXT DOCUMENTS
In this paper, the vector space model is adopted to represent the document data. In this section, the premises and the notations of this research are defined and relative methods are explained.
2.1 Similarity among Documents
Let the set of documents be Δ={
d _{1},d _{2} ,…,d _{D}}, and the set of categories to which each document in training data belongs beC ={c _{1},c _{2} ,…,c _{N}}. Here,D andN are the numbers of documents and categories respectively.All documents in Δ are separated into morphologies by the morphological analysis. Selecting valid words from the given morphologies by using frequencies, mutual information, or other methods, a word set is constructed to represent documents in the vector space. Let the word set from the all documents in Δ be Σ = {
w _{1},w _{2} ,…,w _{W}}Then each document can be expressed as a point in the vector space. Here,
W is the number of different valid words which appear in Δ and it is equivalent to the dimension of the vector space. Here, let the frequency of a wordw_{j} in the document bed _{i} and it can be expressed byv ^{i}_{j}W dimensional vector as follows:where,
T means transposition of a vector. That is, the document space can be constructed by regarding each component of vectors as the information about the frequency of each word.By expressing document as a vector, a similarity or distance metrics between documents can be measured in the vector space. Here, the distance metric of document vectors
andd_{i} can be expressed by using the Euclid distance which is traditionally used as follows:d_{j} However, the Euclid distance sometimes cannot work effectively to treat the document data. For example, many elements in a document vector are almost 0 and this property is called “sparseness.” If there are only a few different words between two documents, these documents are judged as having high similarity.
On the other hand, the cosine similarity measure between document vectors has asymptotically good performance in a sparse vector space (Goto
et al. , 2007), (Gotoet al. , 2008).2.2 Weighting Method for Documents
The way of weighting can be expressed the Hadamard product of document vector
and weighting vectord_{i} . Here, let weighting vectorf bef Here,
is weighting function for the wordf_{k} By using that, let the weighted document vector bew_{k} ^{*}d _{i} where, let
^{*}d _{i} = ⊙f is Hadamard product ofd_{i} andf .d_{i} There are several weighting methods for text data analyzing. As mentioned above, the most famous method is is
tf idf weighting. It can be calculated the product oftf (term frequency, that isv_{i}^{j} andidf (inverse document frequency). Here, thetf is representing the frequencies of each term. Andidf is a monotonically decrease function of an appearance rate of term in documents. Letdf (w_{k} ) be a number of documents in which the wordw_{k} appears. Then theidf weighting for the wordw_{k} is given byHere, let the weight function
bef _{k}idf (w_{k} ),tf idf weighting can be expressed as follows:3. THE METHOD OF SUPERVISED WEIGHTING FOR TEXT CLASSIFICATION
As mentioned above,
tf idf weighting is not made use of side information or metadata when calculating each word weight. In the basic text classification problem, training data has information which category training data belongs to. Therefore, by making the most use of the information for weighting each word can be improved its performance.In this section, we propose the optimal weighting method and its estimation for text classification. And we derive the expression of optimal weight is given by the solution of maximization problem.
To formulate the learning algorithm to estimate the optimal weighting from training data set, the centroid of each category is defined as equation (8). The centroid can be calculated by using the training data with known category. Let the centroid vector of category
c_{n} (n = 1, 2, …,N ) be = (g_{n} g _{1},g _{2}, …,g_{nW} )^{T} . Then the centroid vector is given byg_{n} Here, the 
c_{n}  is a number of documents contained categoryc_{n} . And the same as equation (5), weighted centroid of categoryc_{n} is given byNormally, training data should be allocated by near the centroid of its category because a new document data is classified by the distance from the centroids. Due to that, the optimization of the weight should maximize the similarity between training data and its centroid.
From the above discussion, the estimation of optimal weighting vector
is to maximize cosine similarity between weighted document vectorf ^{*}d _{i} and weighted centroid vector . If the weighting vector can be obtained by the maximization of the similarity between data and its category’s centroid, the proposed weighting method is expected to have a good performance to classify the data into proper categories. Then, our optimal weighting vectorg ^{*}_{n} is given byf Here,
= (f f _{1},f _{2}, …,f_{W} )^{T} is weighting vector, and it satisfiesEquation (10) can be expressed by matrix operation by using the diagonal matrix which is constructed its elements by (
f _{k} )^{2}. To use that, the optimal weight of each word can be estimated to maximize cosine similarity between training data and each category’s centroidd_{i} by using the matrix.g_{n} In order to solve the above problem, we introduce the diagonal metric matrix
M = [m_{k} ] (k =1, 2, …, W ) with the elements (f _{k} )^{2} that isHere,
M = [m_{k} ] is satisfying M  = 1, and M  is determinant ofM .From above discussion, the equation (10) can be rewritten as follows:
Supject to M = 1
From the above formulation, the following theorem is obtained.
Theorem 1: The metric matrixwhich is satisfies the equation (11) is given by
(For the proof, see APPENDIX.)
From above discussion, the similarity between document vector and centroid can be calculated by using the metric matrix
given by the equation (12) as follows:
After calculating similarity among test data and all centroids by equation (13), test data is classified to the most similar category.
4. EXPERIMENTS
4.1 Experimental Condition
In this section, simulation experiments are conducted to verify the effectiveness of our method by using Japanese document data in practice. The experiments are performed for the two data sets, i.e, customer reviews which are posted on a web site and newspaper articles with preassigned categories. The suitable weight is estimated by learning through these data. And the effectiveness is confirmed by the performance of classification of test data into the correct categories.
The basic process of experiments is as follows.
Step1: Calculation of all centroids from training data.Step2: From equation (12), learning metric matrixfrom training data.
Step3: From equation (13), to calculate similarity among test data and all category’s centroid. And it classifies that with most similar category.As mentioned above, two different types of data are used on this experiment. The first one is articles from the Mainichi newspaper in 2005 which are used as a benchmark data for document classification. The second one is customer review that is posted on a web site in order to recognize the performance of our method for real data. News articles in the Mainichi newspaper consist of several categories (economics, sports, politics and so on) and this time, three and five categories are extracted at random. Customer review consists of not only text data but degree of customer’s satisfaction as mentioned above. In this experiment, we use two categories of that which are highest degree of satisfaction and lowest one.
A condition of experiments is shown in Table 1.
For the sake of comparison, the experiment of the common cosine measure with only term frequency (
tf measure) andtf idf measure between different two data points was also performed. The criteria of evaluation are classification accuracy rate. It’s the ratio of the number of documents which are classified into correct categories to the total number of documents.4.2 Result of Experiments
The results of experiments are shown in Figure 1, Figure 2 and Figure 3. Figure 1 and Figure 2 show the cases of the newspaper article. The former is the case of three categories and the latter is that of five.
In addition, Figure 3 shows the case of customer review. It shows supervised weighting (proposed method),
tf idf weighting and only use term frequency (tf weighting).From the Figure 1 and Figure 2, the proposed method is basically superior to
tf idf weighting andtf weighting methods. However, from the Figure 3, the proposed method and the other two methods are almost the same performance. That is, the performance of proposed method is more suitable when newspaper article is used.4.3 Discussion
From the result, the proposed method is more suitable than conventional methods when newspaper article is used. In the proposed method, category information of training data as weighting parameter is taken into consideration. This property can work well for text classification. On the other hand,
tf idf weighting is not used each category’s characteristics. As the result, the performance is improved.However, the performance of the proposed method which can be solved regarded as the optimal solution is not improved drastically in the customer review classification. Because customer write their comment freely for merchandise which they bought and used. Due to the fact, it may appear many invalid words and the tendency of words is not different between each category. The customer reviews are too high of freedom of text data for the proposed method to estimate such complex statistical structure. By these reason, the proposed method doesn’t work effectively. It is necessary to improve the method as getting better performance for customer review (real data) as the future work.
5. CONCLUSION
In this paper, the suitable weighting method by estimating optimal weight from a training data set is proposed. Our proposal is based on supervised learning with optimal metric matrix
which can be calculated by a training data set.
From the simulation experiment, the proposed method is superior than conventional method when newspaper article is used.
A future work is to calculate the contribution ratio in each word which is weighted by the proposed method for the text classification in order to improve performance.
APPENDIX: Proof of Theorem 1
We show the proof of Theorem 1 as follows.
To solve for
M under the M  = 1 is equivalent to solve following maximization:In the following, to simplify the calculation, document
and centroidd^{*}_{i} is normalized by its norm. In other words, it can be said g^{*}_{n}  = d^{*}_{i}  = 1. ^{1)} So equation (14) becomesg^{*}_{n} Here,
M is diagonal matrix and M  =1. Therefore,Therefore,
is derived.
To maximize the equation (15) under the equation (16), Lagrange multipliers method is used. Here, let Lagrange multiplier be
λ , and Lagrange function can be definedHere,
L is partially differentiated with respect tom_{k} . And put them 0,is derived. To solve the equation (18) for
m_{k} ,is derived.
Here, let
M ^{?1} beThen form the equation (19),
is
Here, let set
A beA = [a_{kl} ] and seta_{k} beThen
isA Therefore,
is derived. From the equation (22),
A = λM ^{?1}, is acquired. By the characteristics ofM ,is derived. Therefore, λ becomes
Accordingly, form equations (19) and (25),
becomes
Here, because
A is diagonal matrix, A  is given byConsequently,
is derived.
Accordingly, from equations (26) and (28), each
becomes
The proof is complete. □
In case of d*i ≠ 1, g*n ≠ 1, it can be easy to extend.

[Table 1.] A Condition of Experiments.

[Figure 1.] Result for Classification Accuracy Using Newspaper (3 Categories).

[Figure 2.] Result for Classification Accuracy Using Newspaper (5 Categories).

[Figure 3.] Result for Classification Accuracy Using Customer Review.