Document Classification Model Using Web Documents for Balancing Training Corpus Size per Category
- Author: Park So-Young, Chang Juno, Kihl Taesuk
- Organization: Park So-Young; Chang Juno; Kihl Taesuk
- Publish: Journal of information and communication convergence engineering Volume 11, Issue4, p268~273, 31 Dec 2013
In this paper, we propose a document classification model using Web documents as a part of the training corpus in order to resolve the imbalance of the training corpus size per category. For the purpose of retrieving the Web documents closely related to each category, the proposed document classification model calculates the matching score between word features and each category, and generates a Web search query by combining the higher-ranked word features and the category title. Then, the proposed document classification model sends each combined query to the open application programming interface of the Web search engine, and receives the snippet results retrieved from the Web search engine. Finally, the proposed document classification model adds these snippet results as Web documents to the training corpus. Experimental results show that the method that considers the balance of the training corpus size per category exhibits better performance in some categories with small training sets.
Document classification , Query generation , Text processing , Web documents
Recently, a huge number of documents have been produced and stored in digital archives . Therefore, document classification plays a very important role in many information management and retrieval tasks. It refers to the task of classifying a document into a predefined category. Most digital documents are frequently updated, and writers disseminate information and present their ideas on various topics . Unlike news articles written by well-educated journalists and standardized according to official style guides, most digital documents, such as weblogs and twitter tweets, tend to contain colloquial sentences and slang that misleads the classifier, because anybody can publish anything [3-6]. Considering this cumbersomeness of document classification, researchers have proposed the following approaches.
First, naive Bayes-based approaches have performed well from the perspectives of spam filtering and email categorization, and they require a small training corpus to estimate the parameters necessary for document classification . However, the conditional independence assumption is violated by real-world data and does not consider the frequency of word occurrences . Therefore, these approaches perform very poorly when features are highly correlated .
Second, the support vector machine (SVM)-based approaches have been recognized as some of the most effective document classification methods as compared to supervised machine learning algorithms. Nevertheless, these approaches have certain difficulties with respect to parameter tuning and kernel selection . Further, the performance of SVM suitable for binary classification degrades rapidly, as the amount of training data decreases, resulting in a relatively poor performance on large-scale datasets with many labels .
Third, knowledge-based approaches have utilized knowledge, such as the meaning and relationships of the words, which can be obtained from an ontology, such as WordNet . However, considering the restricted incompleteness of an ontology, the effect of word sense and relation disambiguation is quite limited [9,10].
Fourth, Web-based approaches have been used for mining desired information, such as popular opinions from Web documents obtained by traversing the Web in order to explore information that is related to a particular topic of interest. However, Web crawlers attempt to search the entire Web, which is impossible because of the size and the complexity of the World Wide Web .
In this paper, we propose a document classification model to solve the imbalance of the training corpus size per category by utilizing Web documents. The rest of this paper is organized as follows: Section II presents an overview of the proposed model consisting of a prediction phase to assign a category to a given document, and a learning step to balance the training corpus size by adding Web documents. Then, Section III presents some experimental results, and the characteristics of the proposed method are presented as the conclusion of this paper in Section IV.
As shown in Fig. 1, the proposed Web-based document classification model is composed of a prediction phase and a learning phase. Given an unlabeled document in the prediction phase, the feature extraction step extracts some word features, such as unigrams and bigrams from the document. Then, the document classification step predicts the category of the document by utilizing the learned parameters. On the other hand, the learning phase obtains these learned parameters, such as word-frequency distribution from the category-labeled documents. Section II-A introduces the document classification method, and Section II-B describessed Web document acquisition method in detail.
The proposed document classification method selects the category
ci, taking the highest matching probability for the given unlabeled document D, as represented in Eq. (1), where cidenotes an element of the category set C. In order to make it easy to mathematically deal with the document classification problem, the proposed model represents the unlabeled document Das the document vector
consisting of feature values extracted from the document by the feature extraction step. Each feature value indicates the number of times that each word feature, such as the bigram “arcade game” appears in the document . Most elements of the document vector take a zero value.
In order to balance the training corpus size per category, the proposed Web-based document classification model increases the number of labeled documents by utilizing a Web search engine to retrieve the Web documents.
In the query generation step shown in Fig. 1, the proposed method first selects some useful word features per category on the basis of the chi-square statistics [16,17]. The chi-square statistics of word feature
fjin the category ciis defined as follows:
Adenotes the number of documents containing the word feature fjin the category ci, Brepresents the number of documents containing the word feature fjin other categories rather than ci, Cindicates the number of documents not containing the word feature fjin the category ci, Drefers to the number of documents not containing the word feature fjin other categories rather than ci, and Ndenotes the total
number of documents. Each word feature
fjis computed for every category, and the word features with top nhighest chisquare statistics are used for the query candidates. Table 1 shows some examples of word features selected for the categories, such as health, music, shopping, social network service (SNS), and sports according to the chi-square statistics.
Given the selected word features per category, the proposed Web document acquisition method generates a query for the Web search engine. Considering that the lower-ranked word features can be less related to the characteristics of the category, the proposed method combines two of the higher-ranked word features and the title of the category for the purpose of retrieving the Web documents closely related to the category. Therefore, the query consists of three word features: the category title, one of the top 5 ranked word features, and one of the top 200 ranked word features except the top 5 ranked word features. For each category, the proposed query generation step maximally yields 975 queries by multiplying the 5 higherranked word features with the 195 lower-ranked word features.
Finally, the proposed Web document acquisition method sends each combined query to the open application programming interface (API) of the Web search engine, and receives the snippet results retrieved from the Web search engine. The proposed Web document acquisition method assumes the snippet results of each query as one Web document.
In order to prove the validity of utilizing the Web documents, we have tested the MALLET document classification package  with a mobile application description document corpus , which is divided into 90% for the training set and 10% for the test set. The document corpus consists of 3,521 documents and 302,772 words. Each
document has 3 to 515 words, and a document is composed of 85.99 words on average. As described in Fig. 2, the distribution of documents is not balanced among categories in the corpus. For instance, the documents corresponding to the
Utilitycategory capture roughly 21% of the corpus, while the documents corresponding to the music category capture 0.28% of the corpus.
On the other hand, the proposed method is evaluated on the basis of the following evaluation criteria: precision, recall, and F-measure. Precision indicates the ratio of correct candidate categories to the candidate categories predicted by the proposed document classification model. Recall indicates the ratio of correct candidate categories to the total number of categories of the documents in the corpus. F-measure indicates the harmonic mean of the precision and the recall.
Fig. 3 shows that the performance is improved by adding Web documents; here, each performance value is an average of the 10-fold cross validation. In this figure, the y-axis indicates the recall value, and x-axis denotes the number of Web documents. Because the proposed document classification model predicts all categories of the given documents, the precision is the same as the recall. The baseline performance without any document classification model is
21% because the documents corresponding to the utility account for roughly 21% of the corpus.
Equal sizeperformance indicates the recall of the document classification model MALLET, learned from the training set with a different number of Web documents per category for the balance of the training corpus size. Equal additionperformance indicates the recall of the document classification model MALLET, learned from the training set with an equal number of Web documents per category . For example, the Equal sizemethod does not add any Web documents for categories, such as Utilitythat already have many documents, while the method adds many Web documents for categories, such as Musicthat have few documents. On the other hand, the Equal additionmethod always adds the same number of Web documents to each category according to the x-axis value.
Fig. 3 shows that the
Equal sizemethod is slightly more effective than the Equal additionmethod  because the document classification model decreases the tendency of the ambiguous or less informative document to correspond to the category with the relatively more documents, by balancing the number of documents per category. Further, Fig. 3 shows that the performance does not always increase according to the addition of Web documents because the characteristics of mobile application description documents is considerably different from the characteristics of Web documents, which are very sensitive to Web search queries. Moreover, the performance does not generally increase much upon the addition of Web documents as most category prediction results in the test set are biased towards a few categories corresponding to many documents, such as Utility category, Puzzle/Board gamecategory, and Health/Fitnesscategory, whereas Web documents improve the classification performance for the categories with a few documents, such as Musiccategory, Testurselfcategory, and SNScategory.
Fig. 4 shows the effects of Web documents according to categories at the peak performance by adding 150 Web documents per category. It indicates that the effects of Web documents can differ considerably from each other according to the category, although the general performance did not increase much upon the addition of Web documents. In particular, the recall and the precision of
Musiccategory increased by 100% because the document classification model without Web documents did not predict Musiccategory at all, whereas the document classification model with Web documents attempted to predict the correct Musiccategory once. Like the Testurselfcategory and the SNScategory, categories with fewer mobile application description documents increased the recall because the probability of predicting these categories increased by adding Web documents corresponding to the categories. On the other hand, categories, such as Utilitycategory and the Health/Fitnesscategory with many documents, increased the precision because the possibility of predicting these categories decreases. However, Foodcategory and Educationcategory decreased the recall because the proposed method did not find an appropriate query for Web document retrieval because these categories had few documents.
In this paper, we propose a document classification model utilizing Web documents. The proposed model has the following characteristics: the proposed model can adjust the balance of the training corpus size per category by adding Web documents to the training corpus. Further, the proposed model can retrieve Web documents that are closely related to each category by combining two of the higher-ranked word features and the category title according to the matching score between the word features and the category. Experimental results show that the proposed model improves performance in some categories in the case of small training sets by balancing the number of documents per category. In the future, we intend to compare the effects of some feature selection methods, such as mutual information and information gain. Further, we plan to automatically generate word clusters and apply them to the proposed model.
[Fig. 1.] Proposed Web-based document classification model.
[Table 1.] Some word features selected on the basis of chi-square statistics
[Fig. 2.] Document distribution according to categories.
[Fig. 3.] Performance variation by the addition of Web documents.
[Fig. 4.] Performance difference between baseline and addition of Web documents.