Document Classification Model Using Web Documents for Balancing Training Corpus Size per Category

  • cc icon
  • ABSTRACT

    In this paper, we propose a document classification model using Web documents as a part of the training corpus in order to resolve the imbalance of the training corpus size per category. For the purpose of retrieving the Web documents closely related to each category, the proposed document classification model calculates the matching score between word features and each category, and generates a Web search query by combining the higher-ranked word features and the category title. Then, the proposed document classification model sends each combined query to the open application programming interface of the Web search engine, and receives the snippet results retrieved from the Web search engine. Finally, the proposed document classification model adds these snippet results as Web documents to the training corpus. Experimental results show that the method that considers the balance of the training corpus size per category exhibits better performance in some categories with small training sets.


  • KEYWORD

    Document classification , Query generation , Text processing , Web documents

  • I. INTRODUCTION

    Recently, a huge number of documents have been produced and stored in digital archives [1]. Therefore, document classification plays a very important role in many information management and retrieval tasks. It refers to the task of classifying a document into a predefined category. Most digital documents are frequently updated, and writers disseminate information and present their ideas on various topics [2]. Unlike news articles written by well-educated journalists and standardized according to official style guides, most digital documents, such as weblogs and twitter tweets, tend to contain colloquial sentences and slang that misleads the classifier, because anybody can publish anything [3-6]. Considering this cumbersomeness of document classification, researchers have proposed the following approaches.

    First, naive Bayes-based approaches have performed well from the perspectives of spam filtering and email categorization, and they require a small training corpus to estimate the parameters necessary for document classification [4]. However, the conditional independence assumption is violated by real-world data and does not consider the frequency of word occurrences [7]. Therefore, these approaches perform very poorly when features are highly correlated [8].

    Second, the support vector machine (SVM)-based approaches have been recognized as some of the most effective document classification methods as compared to supervised machine learning algorithms. Nevertheless, these approaches have certain difficulties with respect to parameter tuning and kernel selection [7]. Further, the performance of SVM suitable for binary classification degrades rapidly, as the amount of training data decreases, resulting in a relatively poor performance on large-scale datasets with many labels [8].

    Third, knowledge-based approaches have utilized knowledge, such as the meaning and relationships of the words, which can be obtained from an ontology, such as WordNet [8]. However, considering the restricted incompleteness of an ontology, the effect of word sense and relation disambiguation is quite limited [9,10].

    Fourth, Web-based approaches have been used for mining desired information, such as popular opinions from Web documents obtained by traversing the Web in order to explore information that is related to a particular topic of interest. However, Web crawlers attempt to search the entire Web, which is impossible because of the size and the complexity of the World Wide Web [11].

    In this paper, we propose a document classification model to solve the imbalance of the training corpus size per category by utilizing Web documents. The rest of this paper is organized as follows: Section II presents an overview of the proposed model consisting of a prediction phase to assign a category to a given document, and a learning step to balance the training corpus size by adding Web documents. Then, Section III presents some experimental results, and the characteristics of the proposed method are presented as the conclusion of this paper in Section IV.

    II. WEB-BASED DOCUMENT CLASSIFICATION MODEL

    As shown in Fig. 1, the proposed Web-based document classification model is composed of a prediction phase and a learning phase. Given an unlabeled document in the prediction phase, the feature extraction step extracts some word features, such as unigrams and bigrams from the document. Then, the document classification step predicts the category of the document by utilizing the learned parameters. On the other hand, the learning phase obtains these learned parameters, such as word-frequency distribution from the category-labeled documents. Section II-A introduces the document classification method, and Section II-B describessed Web document acquisition method in detail.

      >  A. Document Classification Method

    The proposed document classification method selects the category ci, taking the highest matching probability for the given unlabeled document D, as represented in Eq. (1), where ci denotes an element of the category set C. In order to make it easy to mathematically deal with the document classification problem, the proposed model represents the unlabeled document D as the document vector

    image

    consisting of feature values extracted from the document by the feature extraction step. Each feature value indicates the number of times that each word feature, such as the bigram “arcade game” appears in the document [12]. Most elements of the document vector take a zero value.

    image

    In the maximum entropy framework [12-15], the conditional probability of predicting an outcome ci given the history d is defined as follows:

    image
    image

    where fi(d,ci) denotes the feature function, λi represents the weighting parameter of fi(d,ci), k refers to the number of features, and Z(d) indicates the normalization factor for Eq. (3) [13-15].

      >  B. Web Document Acquisition Method

    In order to balance the training corpus size per category, the proposed Web-based document classification model increases the number of labeled documents by utilizing a Web search engine to retrieve the Web documents.

    In the query generation step shown in Fig. 1, the proposed method first selects some useful word features per category on the basis of the chi-square statistics [16,17]. The chi-square statistics of word feature fj in the category ci is defined as follows:

    image

    where A denotes the number of documents containing the word feature fj in the category ci, B represents the number of documents containing the word feature fj in other categories rather than ci, C indicates the number of documents not containing the word feature fj in the category ci, D refers to the number of documents not containing the word feature fj in other categories rather than ci, and N denotes the total

    number of documents. Each word feature fj is computed for every category, and the word features with top n highest chisquare statistics are used for the query candidates. Table 1 shows some examples of word features selected for the categories, such as health, music, shopping, social network service (SNS), and sports according to the chi-square statistics.

    Given the selected word features per category, the proposed Web document acquisition method generates a query for the Web search engine. Considering that the lower-ranked word features can be less related to the characteristics of the category, the proposed method combines two of the higher-ranked word features and the title of the category for the purpose of retrieving the Web documents closely related to the category. Therefore, the query consists of three word features: the category title, one of the top 5 ranked word features, and one of the top 200 ranked word features except the top 5 ranked word features. For each category, the proposed query generation step maximally yields 975 queries by multiplying the 5 higherranked word features with the 195 lower-ranked word features.

    Finally, the proposed Web document acquisition method sends each combined query to the open application programming interface (API) of the Web search engine, and receives the snippet results retrieved from the Web search engine. The proposed Web document acquisition method assumes the snippet results of each query as one Web document.

    III. EXPERIMENTS

    In order to prove the validity of utilizing the Web documents, we have tested the MALLET document classification package [12] with a mobile application description document corpus [18], which is divided into 90% for the training set and 10% for the test set. The document corpus consists of 3,521 documents and 302,772 words. Each

    document has 3 to 515 words, and a document is composed of 85.99 words on average. As described in Fig. 2, the distribution of documents is not balanced among categories in the corpus. For instance, the documents corresponding to the Utility category capture roughly 21% of the corpus, while the documents corresponding to the music category capture 0.28% of the corpus.

    On the other hand, the proposed method is evaluated on the basis of the following evaluation criteria: precision, recall, and F-measure. Precision indicates the ratio of correct candidate categories to the candidate categories predicted by the proposed document classification model. Recall indicates the ratio of correct candidate categories to the total number of categories of the documents in the corpus. F-measure indicates the harmonic mean of the precision and the recall.

    Fig. 3 shows that the performance is improved by adding Web documents; here, each performance value is an average of the 10-fold cross validation. In this figure, the y-axis indicates the recall value, and x-axis denotes the number of Web documents. Because the proposed document classification model predicts all categories of the given documents, the precision is the same as the recall. The baseline performance without any document classification model is

    21% because the documents corresponding to the utility account for roughly 21% of the corpus.

    Equal size performance indicates the recall of the document classification model MALLET, learned from the training set with a different number of Web documents per category for the balance of the training corpus size. Equal addition performance indicates the recall of the document classification model MALLET, learned from the training set with an equal number of Web documents per category [19]. For example, the Equal size method does not add any Web documents for categories, such as Utility that already have many documents, while the method adds many Web documents for categories, such as Music that have few documents. On the other hand, the Equal addition method always adds the same number of Web documents to each category according to the x-axis value.

    Fig. 3 shows that the Equal size method is slightly more effective than the Equal addition method [19] because the document classification model decreases the tendency of the ambiguous or less informative document to correspond to the category with the relatively more documents, by balancing the number of documents per category. Further, Fig. 3 shows that the performance does not always increase according to the addition of Web documents because the characteristics of mobile application description documents is considerably different from the characteristics of Web documents, which are very sensitive to Web search queries. Moreover, the performance does not generally increase much upon the addition of Web documents as most category prediction results in the test set are biased towards a few categories corresponding to many documents, such as Utility category, Puzzle/Board game category, and Health/Fitness category, whereas Web documents improve the classification performance for the categories with a few documents, such as Music category, Testurself category, and SNS category.

    Fig. 4 shows the effects of Web documents according to categories at the peak performance by adding 150 Web documents per category. It indicates that the effects of Web documents can differ considerably from each other according to the category, although the general performance did not increase much upon the addition of Web documents. In particular, the recall and the precision of Music category increased by 100% because the document classification model without Web documents did not predict Music category at all, whereas the document classification model with Web documents attempted to predict the correct Music category once. Like the Testurself category and the SNS category, categories with fewer mobile application description documents increased the recall because the probability of predicting these categories increased by adding Web documents corresponding to the categories. On the other hand, categories, such as Utility category and the Health/Fitness category with many documents, increased the precision because the possibility of predicting these categories decreases. However, Food category and Education category decreased the recall because the proposed method did not find an appropriate query for Web document retrieval because these categories had few documents.

    IV. CONCLUSIONS

    In this paper, we propose a document classification model utilizing Web documents. The proposed model has the following characteristics: the proposed model can adjust the balance of the training corpus size per category by adding Web documents to the training corpus. Further, the proposed model can retrieve Web documents that are closely related to each category by combining two of the higher-ranked word features and the category title according to the matching score between the word features and the category. Experimental results show that the proposed model improves performance in some categories in the case of small training sets by balancing the number of documents per category. In the future, we intend to compare the effects of some feature selection methods, such as mutual information and information gain. Further, we plan to automatically generate word clusters and apply them to the proposed model.

  • 1. Nyberg K., Raiko T., Tiinanen T., Hyvonen E. 2010 “Document classification utilising ontologies and relations between documents” [in Proceeding of the 8th Workshop on Mining and Learning with Graphs] P.86-93 google
  • 2. Ayyasamy R. K., Tahayna B., Alhashmi S., Eu-Gene S., Egerton S. 2010 “Mining Wikipedia knowledge to improve document indexing and classification” [in Proceeding of 10th International Conference on Information Sciences, Signal Processing and their Applications] P.806-809 google
  • 3. Ferreira R., Freitas F., Brito P., Melo J., Lima R., Costa E. 2013 “RetriBlog: an architecture-centered framework for developing blog crawlers” [Expert Systems with Applications] Vol.40 P.1177-1195 google doi
  • 4. Park S., Kim C. W., An D. U. 2009 “E-mail classification and category reorganization using dynamic category hierarchy and PCA” [Journal of Information and Communication Engineering] Vol.7 P.351-355 google
  • 5. Yun H. 2011 “Classifying temporal topics with similar patterns on Twitter” [Journal of Information and Communication Engineering] Vol.9 P.295-300 google doi
  • 6. Yun H. 2012 “Quantifying influence in social networks and news media” [Journal of Information and Communication Convergence Engineering] Vol.10 P.135-140 google doi
  • 7. Baharudin B., Lee L. H., Khan K. 2010 “A review of machine learning algorithms for text-documents classification” [Journal of Advances in Information Technology] Vol.1 P.4-20 google doi
  • 8. Rubin T. N., Chambers A., Smyth P., Steyvers M. 2012 “Statistical topic models for multi-label document classification” [Machine Learning] Vol.88 P.157-208 google doi
  • 9. Lu G., Huang P., He L., Cu C., Li X. 2010 “A new semantic similarity measuring method based on Web search engines” [WSEAS Transactions on Computers] Vol.9 P.1-10 google
  • 10. Jialei Z., Hwang C. G., Jung G. D., Choi Y. K. 2011 “A design of K-XMDR search system using topic maps” [Journal of Information Communication Engineering] Vol.9 P.287-294 google doi
  • 11. Samarawickrama S., Jayaratne L. 2011 “Automatic text classification and focused crawling” [in Proceeding of 6th International Conference on Digital Information Management] P.143-148 google
  • 12. McCallum A. K. MALLET: a machine learning for language toolkit [Internet]. google
  • 13. Berger A. L., Della Pietra V. J., Della Pietra S. A. 1996 “A maximum entropy approach to natural language processing” [Computational Linguistics] Vol.22 P.39-71 google
  • 14. Lim J. H., Hwang Y. S., Park S. Y., Rim H. C. 2004 “Semantic role labeling using maximum entropy model” [in Proceeding of the Conference on Computational Natural Language Learning] P.122-125 google
  • 15. Tan H., Zhao T., Wang H., Hong W. P. 2007 “Identification of Chinese event types based on local feature selection and explicit positive & negative feature combination” [International Journal of the Korean Institute of Maritime Information and Communication Sciences] Vol.5 P.233-238 google
  • 16. Yang Y., Pedersen J. O. 1997 “A comparative study on feature selection in text categorization” [in Proceeding of the 14th International Conference on Machine Learning] P.412-420 google
  • 17. Seki K., Mostafa J. 2005 “An application of text categorization methods to gene ontology annotation” [in Proceeding of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval] P.138-145 google
  • 18. Kihl T., Chang J., Park S. Y. 2012 “Application tag system based on experience and pleasure for hedonic searches,” in Convergence and Hybrid Information Technology P.342-352 google
  • 19. Park S. Y., Chang J., Kihl T. 2013 “Application of Web search results for document classification,” in Future Information Communication Technology and Applications P.293-298 google
  • [Fig. 1.] Proposed Web-based document classification model.
    Proposed Web-based document classification model.
  • [Table 1.] Some word features selected on the basis of chi-square statistics
    Some word features selected on the basis of chi-square statistics
  • [Fig. 2.] Document distribution according to categories.
    Document distribution according to categories.
  • [Fig. 3.] Performance variation by the addition of Web documents.
    Performance variation by the addition of Web documents.
  • [Fig. 4.] Performance difference between baseline and addition of Web documents.
    Performance difference between baseline and addition of Web documents.