A Preliminary Study on the Multiple Mapping Structure of Classification Systems for Heterogeneous Databases
- Author: Lee Seok-Hyoung, Kim Hwan-Min, Choe Ho-Seop
- Organization: Lee Seok-Hyoung; Kim Hwan-Min; Choe Ho-Seop
- Publish: INTERNATIONAL JOURNAL OF KNOWLEDGE CONTENT DEVELOPMENT and TECHNOLOGY Volume 2, Issue1, p51~65, 30 June 2012
While science and technology information service portals and heterogeneous databases produced in Korea and other countries are integrated, methods of connecting the unique classification systems applied to each database have been studied. Results of technologists’ research, such as, journal articles, patent specifications, and research reports, are organically related to each other. In this case, if the most basic and meaningful classification systems are not connected, it is difficult to achieve interoperability of the information and thus not easy to implement meaningful science technology information services through information convergence. This study aims to address the aforementioned issue by analyzing mapping systems between classification systems in order to design a structure to connect a variety of classification systems used in the academic information database of the Korea Institute of Science and Technology Information, which provides science and technology information portal service. This study also aims to design a mapping system for the classification systems to be applied to actual science and technology information services and information management systems.
Classification , Mapping Structure , Mapping System , Standard Classification
All information is allocated systematically to an appropriate location according to a given principle, concept, format, or topic, and a classification system is thus required for this purpose. The classification system is a tool for conceptualizing specific information, an instrument for representing and describing knowledge and enhancing access convenience, and performs important functions for systematic information management and services. While a recent mainstream use is web-based information services, a classification system is an essential element for the subject gateway, the directory service, and specialized search services. Furthermore, as both the semantic web and greater data processing technology develop, an emphasis is laid on the significance of classification systems as basic data for giving meaning to information (Yoon, 2006). A classification system is thus applied to and used with most information, and science and technology information also has a plurality of science and technical classifications.
Science and technical classifications are classified into academic classification, literature classification, technical classification, product classification, and industrial classification. The academic classification and the literature classification generally have a similar structure and contents because of their features. In terms of each feature, the academic classification is based on scientific principles, and the technical classification is based on technical areas on the basis of at least one scientific principle. Likewise, the product classification and the industrial classification are based on their source technology and economic effects of products (Song, 2002). Therefore, although the mapping structure between classifications in for enhanced connectivity between science, technology, products, and industry is one of important research contents, the difficulty in connecting features of each classification contributes to the lack of precedent for implementation and verification of the connection structure. This study thus suggests a mapping system for classification systems to be applied to science and technology information services by designing a classification system connection structure for connecting a plurality of science and technical classification systems, and used in the Korea Institute of Science and Technology Information (KISTI). The system suggested in this study can be applied to relation types of classification classes with a multi-classification system connection structure on the basis of related studies about existing classification system connection structure, to support mapping and management of classification systems with a plurality of features.
Classification is carried out to combine similar things into one category and to divide different things. The general result of classification is to arrange concepts and objects in a systematic order (Ko, 2006; Jeong, 1997). A classification system aims to concentrate literature or knowledge of the same or similar subjects into one place, and to provide easy search access. It now covers classification of e-information sources as well.
Classification systems are classified into academic classification, literature classification, technical classification, industrial classification, product classification, encyclopedia classification, information service institution classification, research classification, subject classification, project classification, biological classification, and the like. There is no limit to classification types, especially because the literature classification matches the concept of logical classification and is on the same axis as that of the academic classification theory for academic activities.
The literature classification is for logically and systematically classifying data in libraries depending on their similarity, and is characterized by being more specific and practical in comparison with the academic classification because it is for classifying scientific results. Typical literature classification tables include the DDC (Dewey Decimal Classification), the KDC (Korea Decimal Classification), the UDC (Universal Decimal Classification), the LCC (Library of Congress Classification), and the CC (Colon Classification).
The academic classification, closely related to the library classification, is characterized by its abstract contents, of which the exemplary case is the academic classification system (with 18 categories) of the Korea Research Foundation.
In addition, the technical classification includes science and technology standard classification and industrial technology standard classification. The industrial classification includes the SIC, and the product classification includes the HS, the UNSPSC, and the military supply classification. It is necessary that the academic classification, the literature classification, the industrial classification, and the product classification are specially connected. Classifications can be connected to each other among technologies to which the academic principle is applied, products developed on the basis of those technologies, and the industry to generalize those ideas.
On the other hand, the classification system is divided into enumerative classification and analytic- synthetic (faceted) classification, depending on the format of subject arrangement. Some researchers argue that the analytic-synthetic classification is different from faceted classification. More detailed classification can be achieved through 6 types: enumerative classification, quasi-enumerative classification, quasi-faceted classification, fully fixed faceted classification, quasi-free faceted classification, and free faceted classification.
Enumerative classification predefines symbols about all subject categories, and is a classification system enumerating the symbols with a given rule by means of a top-down approach. The system shown as a symbol is practical, and enables an easy approach to the same category, and higher/lower categories. However, it is difficult to accept new subjects, and requires continuous amendment. The same concept may repeat or appear in different subjects or categories. An exemplary enumerative classification is the LCC.
Faceted classification is a classification system for synthesizing numbers in the final categories by means of predefined rules. This faceted classification analyzes a compound subject into facets in the step of concept and synthesizes them in the steps of language and symbolization, respectively (Jeong, 1997). Providing a variety of auxiliary tables, this faceted classification simplifies the classification system to implement flexible classification, but increases the complexity of tasks for giving classification. An exemplary faceted classification system is the UDC. The DDC and the KDC, exemplary literature classifications, are in the intermediate stage between the enumerative classification system and the faceted classification system.
In addition, classification systems defined according to arrangement structure include list type, tree type (taxonomy), arrangement type (facet), and network type (ontology).
This section relates to the application of classification systems to NDSL (National Discovery for Science Leaders) which is a science and technology information integration service of the KISTI and describes features thereof.
2.2.1. Dewey Decimal Classification (DDC)
The DDC applied to overseas journal articles is a classification system used currently in more than 135 countries since the first issue by Dewey in 1876, and belongs to the literature classification. It has 10 categories, each of which has 10 divisions, each of which has 10 sections, each of which has 10 subsections to hold a layer structure by means of the decimal system (Yoon, 2006). In the NDSL, the DDC classification codes are given to overseas journal articles and overseas conference proceedings, and the 22nd issue published in September 2003, is currently applied. Because the overseas journal articles and conference proceedings are provided as science and technology documents, it is reasonable to apply the DDC which is literature classification, but it is hard for users to know the unique subject of articles because the DDC subject classification is collectively applied to a journal unit, not an article unit.
2.2.2. KISTI Standard Classification <2002 edition>
The KISTI Standard Classification (2002 edition) applied to Korea’s journal articles and research reports is a previous version of the KISTI Standard Classification (2005 edition), and a classification system for which the JICST classification of Japan was reorganized in conformity with the situation in Korea. It has been used under the name of BIST Classification (science and technology literature classification) in the KISTI which was originally KINITI (Korea Institute of Industrial Technology Information), and is an exemplary subject classification in the field of science and technology which was established as KISTI Standard Classification in 2002 and used until 2005. This classification can belong to technical classification or subject classification, and may have overlapping classification classes. Because it has a predefined subject category for the field of science and technology and a category symbol for a specific subject is given to literature, this classification system belongs to the enumerative classification system. This classification system is used for some articles of Korea’s journal articles and research reports. However, for Korea’s journal articles, the DDC classification system is used for journal information, and the KISTI Standard Classification (2002 edition) for article information. Compatibility of classification information between two layers should be taken into consideration.
2.2.3. INSPEC Classification
The NSPEC classification is a special-purpose classification for the data in 5 specialty fields of the INSPEC database, similar to the subject classification, and is an enumerative classification system in which the subject information is predefined. The INSPEC Classification has a 4-layer structure which is composed of 35 categories, 134 divisions, 1117 sections and 2206 subsections. The INSPEC database is the best basics and application database in the advanced science and technology fields including physics, electricity, electronic engineering, computer and information technology, and has about 3,500 mainstream journals, 1,500 meeting proceedings, books, index and abstract information for theses and reports. The subject category and category symbols are described in the meta information of the articles.
2.2.4. Science and Technical Standard Classification Codes, KISTI Standard Classification (2005 edition)
The KISTI Standard Classification (2005 edition) applied to the science and technology analysis and trend information is the Science and Technical Standard Classification Codes amended on the basis of the revision of the KISTI Standard Classification (2005 edition) included in the technical classification. The Standard Science and Technical Classification Codes was established by the KISTEP (Korea Institute of Science & Technology Evaluation and Planning) in order to standardize classification systems used in specialized major research management institutions and to provide an efficient project management system. It has a 3-stage structure of 19 categories, 160 divisions and 1,203 sections. Because the subject category is predefined and category symbols are given to literature, this classification system thus belongs to the enumerative classification system. The section information is given to analysis and trend information.
The IPC (International Patent Classification) classifies the contents of technology for the relevant patent applications, and includes a layer classification system for subdividing the technical field of inventions into sections, classes, main groups, and subgroups step by step. Because the IPC classification aims for a technical approach to inventions, it is technical classification, but the details are closely related to the subject classification and the industrial classification. The NDSL applies the IPC classification to Korean patents, United States patents, Japanese patents, European patents, and other international patents.
The ICS (International Classification for Standards) is used for addressing inconveniences resulting from different specification classification systems in different countries. It is a method of unified specification classification established by the ISO in 1994 and has specification data for each sector recommended for each country to use the method. The ICS is composed of categories (2 digits), divisions (3 digits), and sections (2 digits) by 7-digit numerical codes, and used for producing ISO catalogues required for international standards, and database standard (specification) classification. The NDSL provides industrial standard information which is used in integrated search for the KS standard, the ISO standard, and the IEC standard, to which the ICS classification is given.
2.2.7. Miscellaneous classifications
The classification system used in journals related to the international food sector and the overseas NDSL research reports has a classification system given by information provider institutions. The overseas NDSL research report uses the project management classification system produced by the NTIS which is a project management institution of the United States. The FoodSTA database used in journals related to the international food sector uses the FSTA classification system. Both classification systems pertain to enumerative classification, and have a layer structure.
Each different classification system is applied to different types of data and used in different configurations. In particular, although the features of Korea’s journal articles and overseas journal articles are similar, both the DDC classification and the KISTI Standard Classification are used because information provision institutions are different. Accordingly, it is not easy to construct the function of literature reading through subject approach of two types of articles. The journal article, patent, research report, trend and analysis information in the life cycle process of research and development is characterized by a very clear context thereof. That is, it is possible to know the latest research trend and details of precedent research through the trend and analysis information in advance, and patents and articles are produced as a product of the process of research and technology development. Because resultant research reports contribute to spreading research achievements and technology, provision of information for connecting them is very significant in terms of supporting research and development. Therefore, efficient organization and connection services for such information are required. One of the most important tasks to achieve this purpose is the inter-connection of science and technology literature by connecting classification systems.
Korean and overseas researchers have studied how to connect classification systems in order to manage and provide information with each different classification system.
Michael Day, of the Renardus Consortium (2002) tried to connect classification systems for the purpose of subject gateway service among major libraries in 7 European countries. The Renardus Consortium was established for the project of providing a cross-search and reading service of information in the 6 major national libraries of Germany, Denmark, Finland, the Netherlands, Sweden, and the UK. The most characteristic feature of the Renardus service is to provide an integrated subject gateway search service for online academic information which belongs to each library. Mapping classification systems are required among the libraries which are members of the Consortium for the subject gateway search. To this end, Renardus developed a mapping system (using the CARMEN tool) between the DDC classification system and the local classification (LC) system, which was used to build a 1:1 mapping table. When a user carries out a subject gateway search on the Renardus service web-page, the mapping table between the DDC and the LC is used to enable the classified information in the Internet information resources possessed by the libraries in each country to be read. The classification system connection structure of the Renardus system is characterized by unidirectional mapping from the DDC classification to the LC, which results from the feature of the subject gateway.
Hong (2004) made an analysis of equivalence/non-equivalence between classification systems which may occur in exchanging information between databases to which two classifications are applied, focusing on ‘Science and Technical Standard Classification Codes’ and ‘KISTI Standard Classification (2002)’. His analysis also focused on the classification codes in the field of chemistry. He defined 5 relation types between classification classes for interoperability among classification systems, and stated the issue of non-equivalence when the actual sample classes were tested.
Song (2006) made an analysis of the feasibility of building the NTIS by using a compatibility table for connecting the Standard Science and the Technical Classification Codes used in the NTIS with the classification systems of research management institutions of each ministry. The analysis revealed that bidirectional interoperability between the NTIS and the classification systems in each research management institution is not very efficient, and it is necessary to consider a new classification system.
Generalizing the above studies, there have been many studies about integration of classification systems by establishing a new classification system, and some studies have even suggested a newly designed classification system. Of course, there has been no actual application of the suggested classification systems yet. This is due to the difficulty of establishing a new standard classification system which satisfies all classification systems for integrating the classification systems, and the amount of time and expenses needed to replace existing classification systems by a new standard classification system, although it could be implemented.
While some overseas studies have suggested a method of connecting classification systems, some studies in Korea have suggested methods of implementation, but no study has suggested a real connection method implemented yet. A mapping table among classification systems is essential for connecting classification systems, and a mapping system for creating the mapping table should be established. The method of connection by means of classification system mapping is superior because it is not needed to completely replace the existing classification systems by a new classification system as in the method of integrating classification systems, but requires much time and expense for classification system mapping. This study was conducted on the assumption that it is better to ensure interconnection of literature by means of a method of connecting a plurality of classification systems, rather than a method of changing the existing classification systems with about 50 million pieces of information of the NDSL into a new classification model, in consideration of expense and efficiency of the aforementioned two methods. The following section describes a method of creating a classification system mapping table and establishing a mapping system.
The most important step in designing the aforementioned classification system connection structure for connecting a plurality of classification systems is to define relations between classification classes which are the classification items for building the system. In this study, an analysis is made of the relation type of the classification systems and the classification classes suggested by Day (2002) and Hong (2004) to define the relativity between classes to be applied to 1:1 mapping.
Day (2002) defined the relation type between classification systems, on the assumption of establishing 5 relations: Fully Equivalent, Narrower Equivalent, Broader Equivalent, Major Overlap and Minor Overlap. Hong (2004) defined 5 semantic equivalent/non-equivalent types: Fully Equivalent, Narrower Equivalent, Broader Equivalent, Major and Minor Overlap and Non-equivalent between two classification classes. Comparing two relation types, Day (2002) divides the overlap type into Major Overlap and Minor Overlap depending on which classification between the DDC classification and his own classification includes more subject categories with respect to the overlap relation. However, Hong (2004) regards the overlap type as one single type. Day (2002) does not recognize the non-equivalent type, but Hong (2004) includes the non-equivalent type category. It seems that the types are slightly different depending on whether the object of the relation type is a classification system or a classification class.
This study gives a definition of 5 relation types of Fully Equivalent, Narrower Equivalent, Broader Equivalent, Major and Minor Overlap and Non-equivalent between classification classes as shown in Table 2 with reference to the above two relation types.
While the 1:1 connection structure between classification systems has the structure shown in Table 1 and achieves mapping, an extension can be implemented to a multi-classification system connection structure which can automatically define the relation type with another classification system. However, in this case mapping is carried out with one classification system among the existing stored classification systems when mapping is further carried out for another classification system. For example, if the classification system A is mapped with the classification system B, and a classification system C is added to be mapped with the classification system B, automatic mapping between the classification system A and the classification system C is implemented according to an extended relation type. Table 2 shows an exemplary relation type of implementing automatic extension between A and C if the relation type (mapping 1) of the classification systems A and B is Fully Equivalent, and the relation type (mapping 2) between classification systems B and C is Fully Equivalent, Narrower Equivalent and Broader Equivalent, respectively.
In this case, if any one mapping is Fully Equivalent, relatively exact equivalence can be implemented for all types other than the mapping type which is Major and Minor Overlap and Non-equivalent.
However, if the Narrower Equivalent type is combined with the Broader Equivalent type as shown in the exemplary extension of Table 3, it is necessary to carry out mapping or connecting the classes of the classification system A with those of the classification system C manually or by means of other elements. It seems that an independent process should be provided for the classes with the Non-equivalent type or the Minor and Major Overlap type.
Now, the classification system mapping system is suggested in consideration of connectivity with the KISTI academic information database on the basis of the classification system connection structure designed above. The mapping system for the classification systems is classified into a classification system registration module, a classification system mapping module, and a classification system management module. The classification system mapping information is managed by the 5 tables shown in Figure 1.
Figure 1 illustrates a flow chart for a mapping system for the classification system used for mapping the DDC classification with the INSPEC Classification. When the classification system information is configured in a given format shown in (1) of Figure 1 to execute a loading program, it is stored in the classification table of (2). In (3), if mapping the INSPEC Classification with the classification class of the DDC classification system is carried out or modified, the relation type of the classification system mapping structure defined in Section 3 is stored in the relation type table of (2) for each class, and the classification table can be related with the mapping table by means of the relation type table. In addition, the log table is used for recording mapping history, and the rule table is for storing and managing relation type list defined in Section 3.
The mapping system for classification systems supports interconnection of science and technology literature by creating mapping tables of the DDC Classification applied to the NDSL science and technology information, the KISTI Standard Classification the INSPEC Classification and the IPC Classification. It provides an interface to allow experts in a specific field and classification experts to define a relation about classification categories of two classification systems finally to create a mapping table for the classification systems. Figure 2 and Figure 3 show how to map two classification systems, composed of classification system search, output from layers of the classification systems, and a component for establishing classification system relation.
In the interface of Figure 3, the classification relation shown in Figure 4 should be allowed to be defined through the component for establishing classification system relation. Figure 4 shows mapping relation for the items of the KISTI Standard Classification (2005 edition), computers in the INSPEC Classification, and database management systems. MAJ202 in the KISTI Standard Classification and C6160 of the INSPEC Classification achieve semantic full equivalence of Table 1. In this case, MAJ202 of the KISTI Standard Classification is in automatic Broader Equivalent relation with the sub-category of C6160 because of a difference between the structure of the KISTI Standard Classification and that of the INSPEC Classification. However, because the sub-categories of C6160 and 005.75 are not fully equivalent in the INSPEC Classification and the DDC classification, C6160 and 005.75 are not in fully equivalent relation, but in overlap relation. Because of the semantic ambiguity of 005.752, 005.754, 005.755, 005.759 and Other in C6160Z, there is no semantic relation, but it is possible to establish a Broader Equivalent relation. The mapping table shown in Table 5 can be created by using the mapping system for classification systems to establish mapping relation of Figure 4.
The mapping table in Table 5 should be designed to enable bi-directional relation setting information for two classification systems to be stored, in order to implement access to one classification system although the other classification system is used.
The mapping system for classification systems designed above is configured in a specialized form by analyzing characteristics of a plurality of classification systems applied to the NDSL. Because most classification systems of the NDSL follow the enumerative classification which defines and lists symbols about all subject categories according to a given rule, system configuration is relatively good. However, if it is a classification system of which the structure is not simple and allows symbol synthesis like the DDC Classification, connection with other classification systems is very complex, and may require much time and expense for mapping. Therefore, this point should be considered. Exceptions and considerations described above that may be encountered in implementing the mapping systems are described below.
How to establish a relation with subject categories if the two classification systems are of different types.
How to establish a relation if the two classification systems are of different configurations, such as the enumerative classification and the faceted classification.
How to handle the information of a created mapping table if it is updated, following amended classification systems.
How to automatically create a mapping table.
How to verify the relation established between two classification systems.
How to automatically extend a category relation with other classification systems which use an existing mapping table.
This study aimed to establish a mapping system for connecting a variety of classification systems applied by the NDSL in order to support interconnection of science and technology literature. To this end, features and problems of the classification systems used in the NDSL were first examined to analyze academic papers which suggest integration and connection of existing classification systems in order to define elements for developing a mapping system for classification systems. The classification systems applied to the NDSL vary, for example, literature classification, technical classification, subject classification, and standard classification, but there is no connection among the classification systems. Therefore, interoperability is not achieved for essential elements in the life cycle process of research and development including journal articles, patent, research reports, analysis, and trend information. This requires connection of classification systems in terms of provision of meaningful information services to users. To this end, this study suggested a mapping system for classification systems as the most effective method. It is necessary to further study a method of establishing mapping relations for supporting classification systems of different types, a method of automatically establishing relations between two categories, and a method of extending category relations that occur in extending classification systems.
[Table 1.] Defining relation in classification system mapping of Renardus (Day, 2002)
[Table 2.] Definition of relation types between classes
[Table 3.] Exemplary extension of relation type -1
[Table 4.] Exemplary extension of relation type -2
[Fig. 1.] Classification system mapping by using mapping system for classification systems
[Fig. 2.] Interface of Classification Mapping System Fig.
[Fig. 3.] Example of establishing classification system relation
[Fig. 4.] Example of Mapping Structure
[Table 5.] Creating mapping table by using a mapping system