An Optimal Weighting Method in Supervised Learning of Linguistic Model for Text Classification

Mikawa Kenta; Ishida Takashi; Goto Masayuki

doi:10.7232/iems.2012.11.1.087

OA학술지
Industrial Engineering and Management Systems

An Optimal Weighting Method in Supervised Learning of Linguistic Model for Text Classification

DOI : 10.7232/iems.2012.11.1.087
Author: Mikawa Kenta, Ishida Takashi, Goto Masayuki
Organization: Mikawa Kenta; Ishida Takashi; Goto Masayuki
Publish: Industrial Engineering and Management Systems Volume 11, Issue1, p87~93, 01 March 2012

ABSTRACT

An Optimal Weighting Method in Supervised Learning of Linguistic Model for Text Classification

KEYWORD

Text Classification , Weighting Method , Vector Space Model , Cosine Similarity

본문

Collapse all

1. INTRODUCTION

Due to development of information technology, the effectiveness of knowledge discovery from enormous document data is suggested in much of the literatures on this subject (Hearst. 1999). There are many web sites where customers can post their free comments about merchandise that they bought and used. On the internet, the number of customer reviews is increasing day by day. Therefore it has been easy to get a large amount of document data and analyze it for several purposes. Customer reviews consist of not only free comments but customer information and the degree of satisfaction about items as metadata. The analysis using the metadata is more helpful for knowledge discovery than using only text data. The techniques for text mining are developed for the purpose of getting information. Various methods have been proposed in this research field, for example, vector space model (Manning et al., 2008), (Mikawa et al., 2012), probabilistic model (Hofmann. 1999), (Bishop, 2006) and so on.

In this paper, a vector space model is the focus for document analysis. To construct a vector space model for document analysis, the documents are separated into terms or words (morphemes) by using the morphological analysis (Nagata. 1994). After that, each document is represented by a vector whose elements express the information of word frequency of appearance. Because the vector space is built by the information of word frequency, the characteristics of a document vector model should be remarkable: high dimension and sparseness. Generally speaking, untold thousands of words or more should be treated to represent a document vector using effective words appearing in all documents.

As mentioned above, there are enormous words which are appeared in whole documents. In addition, term frequency of each word varies widely in length. Therefore, the performance of text analyzing depends on term frequency of words which is appeared each documents. That is, it depends on the length of documents. To avoid this, several weighting approach for each word has been proposed. For instance, tf-idf weighting (Salton et al., 1988), PWI (Probability-weighted amount of information) (Aizawa, 2000, 2003), mutual information (McCallum et al., 1998) and so on. And tf-idf weighting is one of the most famous method for weighting terms. However, it is proposed for information retrieval and the effectiveness is empirically shown. Therefore, the theoretical optimality is not proved. In addition, it doesn't use the metadata or side information for weighting each word. Nowadays, it can be easy to get or use those, and by using that information, it supposes to improve the performance of each analysis.

From above discussion, the purpose of this study is to propose a new weighting method for each word from the view point of supervised learning. We show the way of estimating an optimal word weighting by solving maximization problems. The effectiveness of this method is clarified by case experiments of applications to the customer review which are posted on web sites and newspaper articles which are used as a bench mark data.

In section 2, basic formulation of vector space model and weighting methods which have already proposed are explained. In section 3, the proposed method of weighting each word and the way of its estimation is explained. The illustration of simulation experiments in order to clarify the effectiveness of our proposal and the results acquired from the experiments are explained in section 4. Finally, the conclusion of this study is stated in section 5.

2. BASIC INFORMATION FOR ANALYSIS OF TEXT DOCUMENTS

In this paper, the vector space model is adopted to represent the document data. In this section, the premises and the notations of this research are defined and relative methods are explained.

   2.1 Similarity among Documents

Let the set of documents be Δ={d₁, d₂ ,…, d_D}, and the set of categories to which each document in training data belongs be C ={c₁, c₂ ,…, c_N}. Here, D and N are the numbers of documents and categories respectively.

All documents in Δ are separated into morphologies by the morphological analysis. Selecting valid words from the given morphologies by using frequencies, mutual information, or other methods, a word set is constructed to represent documents in the vector space. Let the word set from the all documents in Δ be Σ = {w₁, w₂ ,…, w_W}

Then each document can be expressed as a point in the vector space. Here, W is the number of different valid words which appear in Δ and it is equivalent to the dimension of the vector space. Here, let the frequency of a word w_j in the document d_i be vⁱ_j and it can be expressed by W-dimensional vector as follows:

where, T means transposition of a vector. That is, the document space can be constructed by regarding each component of vectors as the information about the frequency of each word.

By expressing document as a vector, a similarity or distance metrics between documents can be measured in the vector space. Here, the distance metric of document vectors d_i and d_j can be expressed by using the Euclid distance which is traditionally used as follows:

However, the Euclid distance sometimes cannot work effectively to treat the document data. For example, many elements in a document vector are almost 0 and this property is called “sparseness.” If there are only a few different words between two documents, these documents are judged as having high similarity.

On the other hand, the cosine similarity measure between document vectors has asymptotically good performance in a sparse vector space (Goto et al., 2007), (Goto et al., 2008).

   2.2 Weighting Method for Documents

The way of weighting can be expressed the Hadamard product of document vector d_i and weighting vector f . Here, let weighting vector f be

Here, f_k is weighting function for the word w_k By using that, let the weighted document vector be d^*_i

where, let d^*_i = f ⊙ d_i is Hadamard product of f and d_i.

There are several weighting methods for text data analyzing. As mentioned above, the most famous method is is tf-idf weighting. It can be calculated the product of tf(term frequency, that is v_i^j and idf(inverse document frequency). Here, the tf is representing the frequencies of each term. And idf is a monotonically decrease function of an appearance rate of term in documents. Let df(w_k) be a number of documents in which the word w_k appears. Then the idf weighting for the word w_k is given by

Here, let the weight function f_k be idf(w_k), tf-idf weighting can be expressed as follows:

3. THE METHOD OF SUPERVISED WEIGHTING FOR TEXT CLASSIFICATION

As mentioned above, tf-idf weighting is not made use of side information or metadata when calculating each word weight. In the basic text classification problem, training data has information which category training data belongs to. Therefore, by making the most use of the information for weighting each word can be improved its performance.

In this section, we propose the optimal weighting method and its estimation for text classification. And we derive the expression of optimal weight is given by the solution of maximization problem.

To formulate the learning algorithm to estimate the optimal weighting from training data set, the centroid of each category is defined as equation (8). The centroid can be calculated by using the training data with known category. Let the centroid vector of category c_n (n = 1, 2, …, N) be g_n = (g₁, g₂, …, g_nW)^T. Then the centroid vector g_n is given by

Here, the |c_n| is a number of documents contained category c_n. And the same as equation (5), weighted centroid of category c_n is given by

Normally, training data should be allocated by near the centroid of its category because a new document data is classified by the distance from the centroids. Due to that, the optimization of the weight should maximize the similarity between training data and its centroid.

From the above discussion, the estimation of optimal weighting vector f is to maximize cosine similarity between weighted document vector d^*_i and weighted centroid vector g^*_n. If the weighting vector can be obtained by the maximization of the similarity between data and its category’s centroid, the proposed weighting method is expected to have a good performance to classify the data into proper categories. Then, our optimal weighting vector f is given by

Here, f = (f₁, f₂, …, f_W)^T is weighting vector, and it satisfies

Equation (10) can be expressed by matrix operation by using the diagonal matrix which is constructed its elements by (f _k)². To use that, the optimal weight of each word can be estimated to maximize cosine similarity between training data d_i and each category’s centroid g_n by using the matrix.

In order to solve the above problem, we introduce the diagonal metric matrix M = [m_k] (k = 1, 2, …, W) with the elements (f _k)² that is

Here, M = [m_k] is satisfying |M| = 1, and |M| is determinant of M.

From above discussion, the equation (10) can be rewritten as follows:

Supject to |M| = 1

From the above formulation, the following theorem is obtained.

Theorem 1: The metric matrix

which is satisfies the equation (11) is given by

(For the proof, see APPENDIX.)

From above discussion, the similarity between document vector and centroid can be calculated by using the metric matrix

given by the equation (12) as follows:

After calculating similarity among test data and all centroids by equation (13), test data is classified to the most similar category.

4. EXPERIMENTS

   4.1 Experimental Condition

In this section, simulation experiments are conducted to verify the effectiveness of our method by using Japanese document data in practice. The experiments are performed for the two data sets, i.e, customer reviews which are posted on a web site and newspaper articles with pre-assigned categories. The suitable weight is estimated by learning through these data. And the effectiveness is confirmed by the performance of classification of test data into the correct categories.

The basic process of experiments is as follows.

Step1: Calculation of all centroids from training data.

Step2: From equation (12), learning metric matrix

from training data.

Step3: From equation (13), to calculate similarity among test data and all category’s centroid. And it classifies that with most similar category.

As mentioned above, two different types of data are used on this experiment. The first one is articles from the Mainichi newspaper in 2005 which are used as a benchmark data for document classification. The second one is customer review that is posted on a web site in order to recognize the performance of our method for real data. News articles in the Mainichi newspaper consist of several categories (economics, sports, politics and so on) and this time, three and five categories are extracted at random. Customer review consists of not only text data but degree of customer’s satisfaction as mentioned above. In this experiment, we use two categories of that which are highest degree of satisfaction and lowest one.

A condition of experiments is shown in Table 1.

[Table 1.] A Condition of Experiments.

A Condition of Experiments.

For the sake of comparison, the experiment of the common cosine measure with only term frequency (tf measure) and tf-idf measure between different two data points was also performed. The criteria of evaluation are classification accuracy rate. It’s the ratio of the number of documents which are classified into correct categories to the total number of documents.

   4.2 Result of Experiments

The results of experiments are shown in Figure 1, Figure 2 and Figure 3. Figure 1 and Figure 2 show the cases of the newspaper article. The former is the case of three categories and the latter is that of five.

[Figure 1.] Result for Classification Accuracy Using Newspaper (3 Categories).

In addition, Figure 3 shows the case of customer review. It shows supervised weighting (proposed method), tf-idf weighting and only use term frequency (tf weighting).

[Figure 2.] Result for Classification Accuracy Using Newspaper (5 Categories).

From the Figure 1 and Figure 2, the proposed method is basically superior to tf-idf weighting and tf weighting methods. However, from the Figure 3, the proposed method and the other two methods are almost the same performance. That is, the performance of proposed method is more suitable when newspaper article is used.

[Figure 3.] Result for Classification Accuracy Using Customer Review.

   4.3 Discussion

From the result, the proposed method is more suitable than conventional methods when newspaper article is used. In the proposed method, category information of training data as weighting parameter is taken into consideration. This property can work well for text classification. On the other hand, tf-idf weighting is not used each category’s characteristics. As the result, the performance is improved.

However, the performance of the proposed method which can be solved regarded as the optimal solution is not improved drastically in the customer review classification. Because customer write their comment freely for merchandise which they bought and used. Due to the fact, it may appear many invalid words and the tendency of words is not different between each category. The customer reviews are too high of freedom of text data for the proposed method to estimate such complex statistical structure. By these reason, the proposed method doesn’t work effectively. It is necessary to improve the method as getting better performance for customer review (real data) as the future work.

5. CONCLUSION

In this paper, the suitable weighting method by estimating optimal weight from a training data set is proposed. Our proposal is based on supervised learning with optimal metric matrix

which can be calculated by a training data set.

From the simulation experiment, the proposed method is superior than conventional method when newspaper article is used.

A future work is to calculate the contribution ratio in each word which is weighted by the proposed method for the text classification in order to improve performance.

APPENDIX: Proof of Theorem 1

We show the proof of Theorem 1 as follows.

To solve for M under the |M| = 1 is equivalent to solve following maximization:

In the following, to simplify the calculation, document d^*_i and centroid g^*_n is normalized by its norm. In other words, it can be said ||d^*_i|| = ||g^*_n|| = 1. ¹⁾ So equation (14) becomes

Here, M is diagonal matrix and |M| =1. Therefore,

Therefore,

is derived.

To maximize the equation (15) under the equation (16), Lagrange multipliers method is used. Here, let Lagrange multiplier be λ, and Lagrange function can be defined

Here, L is partially differentiated with respect to m_k. And put them 0,

is derived. To solve the equation (18) for m_k,

is derived.

Here, let M^?1 be

Then form the equation (19),

is

Here, let set A be A = [a_kl] and set a_k be

Then A is

Therefore,

is derived. From the equation (22), A = λM^?1, is acquired. By the characteristics of M,

is derived. Therefore, λ becomes

Accordingly, form equations (19) and (25),

becomes

Here, because A is diagonal matrix, |A| is given by

Consequently,

is derived.

Accordingly, from equations (26) and (28), each

becomes

The proof is complete. □

In case of ||d*i|| ≠ 1, ||g*n|| ≠ 1, it can be easy to extend.

참고문헌

1. Aizawa A. (2000) The Feature Quantity: An Information Theoretic Perspective of Tfidf-like Measures [Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval] P.104-111
2. Aizawa A. (2003) An Information-theoric perspective tf-idf Measure [Information Processing and Management] Vol.39 P.45-65
3. Bishop C. M. (2006) Pattern Recognition and Machine Learning
4. Goto M., Ishida T., Hirasawa S. (2007) Statistical Evaluation of Measure and Distance on Document Classification Problems in Text Mining [IEEE International Conference on Computer and Information Technology] P.674-679
5. Goto M., Ishida T., Suzuki M., Hirasawa S. (2008) Asymptotic Evaluation of Distance Measure on High Dimensional Vector Space in Text Mining [International Symposium on Information Theory and its Applications]
6. Hearst M. A. (1999) Untangling text data mining [ACL ’99 Proceedings] P.3-10
7. Hofmann T. (1999) Probabilistic Latent Semantic Indexing [Proceeding of the 22nd International Conference on Research and Development in Information Retrieval] P.50-57
8. Manning C. D., Raghavan P., Schuetze H. (2008) Introduction to Information Retrieval
9. McCallum A., Nigam K. (1998) A Comparison of Event Models for Naive Bayes Text Classification [Proceeding of AAAI-98 Workshop on Learning for Text Categorization] P.41-48
10. Mikawa K., Ishida T., Goto M. (2012) A Proposal of Extended Cosine Measure for Distance Metric Learning in Text Classification, Proceeding of 2011 IEEE International Conference on the Systems [Cybernetics (SMC)] P.1741-1746
11. Nagata M. (1994) A Stochastic Japanese morphological analyzer using a forward-DP backward-A* best search algorithm [Proceeding of the 15th International Conference on Computational Linguistics] P.201-207
12. Salton G., Buckley C. (1988) Term-Weighting Approaches in Automatic Text Retrieval [Information Processing and Management] Vol.24 P.513-523

이미지 / 테이블

[ Table 1. ] A Condition of Experiments.
[ Figure 1. ] Result for Classification Accuracy Using Newspaper (3 Categories).
[ Figure 2. ] Result for Classification Accuracy Using Newspaper (5 Categories).
[ Figure 3. ] Result for Classification Accuracy Using Customer Review.

OA 실천선언

기관별보기

유형별보기

OA 학술지

OAK 소개

소통/참여

1. INTRODUCTION

2. BASIC INFORMATION FOR ANALYSIS OF TEXT DOCUMENTS

2.1 Similarity among Documents

2.2 Weighting Method for Documents

3. THE METHOD OF SUPERVISED WEIGHTING FOR TEXT CLASSIFICATION

4. EXPERIMENTS

4.1 Experimental Condition

4.2 Result of Experiments

4.3 Discussion

5. CONCLUSION

APPENDIX: Proof of Theorem 1

OAK 소개

소통/참여

상세검색

자료유형선택

발행연도

기관선택

국내 기관

국외 기관