Pre-Processing of Query Logs in Web Usage Mining

Abdullah Norhaiza Ya; Husin Husna Sarirah; Ramadhani Herny; Nadarajan Shanmuga Vivekanada

doi:10.7232/iems.2012.11.1.082

OA학술지
Industrial Engineering and Management Systems

Pre-Processing of Query Logs in Web Usage Mining

DOI : 10.7232/iems.2012.11.1.082
Author: Abdullah Norhaiza Ya, Husin Husna Sarirah, Ramadhani Herny, Nadarajan Shanmuga Vivekanada
Organization: Abdullah Norhaiza Ya; Husin Husna Sarirah; Ramadhani Herny; Nadarajan Shanmuga Vivekanada
Publish: Industrial Engineering and Management Systems Volume 11, Issue1, p82~86, 01 March 2012

ABSTRACT

Pre-Processing of Query Logs in Web Usage Mining

KEYWORD

Pre-Processing , Web Log , Web Usage Mining

본문

Collapse all

1. INTRODUCTION

Recently with the explosive growth of the amount of content on the Internet, it has become increasingly difficult for users to search and utilize information. Content providers it is also difficult to classify and understand their user’s need. The traditional web search engines often return hundreds or thousands of results for a search, which is time consuming for users to browse. Thus, the process of handling multiple data by multiple users can be time consuming and not efficient.

Basically, there are three types of information that need to be handled in a web site: content, structure and log data (Batista et al., 2002; Dixit and Gadge, 2010; Nicholas et al., 2004). In this paper, we concentrate on web usage mining, which is also known as web log mining. Web usage mining of query logs could help organization in understanding the patterns and profiles of their customer.

Web usage mining can be defined as automatic discovery and analysis of pattern of user access from web server (Cooley et al., 1999). Processes of pattern analysis in web usage mining are divided into three phases. The phases include preprocessing, pattern discovery, and pattern analysis. In preprocessing stage, the query logs are cleaned, users are identified and identification of session. Next, techniques of web usage mining such as association and clustering are performed to obtain hidden patterns to discover the user behavior and profiles. For the last stages of pattern analysis, the discovered patterns are further processed resulting in aggregate user models used as input to generate a recommender tool.

In this paper, we present on the task of prepro-cessing in query log of web usage mining. For this project, we will use query logs from an online news-pa-per company. The query logs will undergo pre-proce-ssing stage, in which the clickstream data is cleaned and partitioned into a set of user interactions which will represent the activities of each user during their visits to the site. The rest of the paper is organized as follows: In section 2, we review some literatures in web usage min-ing and web log. Section 3 describes the implementation of preprocessing process which includes the preproces-sing algorithm. Results are shown in section 4 and section 5 acknowledge persons that permitting us to use the web server logs for the purpose of this study and finally section 6 summarizes the paper and future work.

2. RELATED WORK

Recently, many researchers are interested in web usage mining area. Web mining is the process of extracting knowledge from artifacts and activity related to World Wide Web (Cooley et al., 1999). Based on several studies, Web Usage Mining can be used for different purposes such as personalization, system improvement and site modification (Kumari and Raju, 2010). Data preprocessing phase is a challenging and difficult stage (Cooley et al., 1999). Data pre-processing stage is the most important phase for investigation of the web user usage behavior. To do this one must extract the only human user accesses from web log data which is critical and complex.

According to Cooley et al. (1999) the servers monitor such logging information and maintain the details using special log files. These files however represent information in form of raw textual data which is very difficult for the users to understand.

A Web server log is an important source for performing Web Usage Mining because it explicitly records the browsing behavior of site visitors. The data recorded in server logs reflects the access of a Web site by multiple users (Markellou et al., 2005). The greatest advantage of the Web server logs is that they are records of what people have actually done, and not what they might do or thought they did (Tyagi et al., 2010). The primary function of these logs are to record the operation of the web server, as well as for characterization, evaluation, reporting and website improvement (Mobasher et al., 2000; Nicholas et al., 2004). The web server logs are also commonly used to conduct analysis for the purposes of reporting traffic patterns for advertising or customer analysis. All log files are generated using the common log file format that several WWW servers use (w3c, 1995). Basically, log file contains information on each page request made to the web server. Figure 1 illustrates the format used in Berita Harian’s logs. The information on the log starts with date and time, time zone of web server, IP address of clients, HTTP request status, cache size, URL or page requested, HTTP status code and user agent.

[Figure 1.] Extract of Server Logs from Berita Harian.

3. DATA PREPARATION AND PREPROCESSING

3.1 Web Server

Basically, several pre processing tasks need to be done before implementing web mining algorithm on web server logs. There are five preprocessing tasks as illustrated in Figure 2. The tasks are data cleaning, user identification, session identification, path completion and transaction identification (Cooley et al., 1999). To prepare the web server log for mining process, the data needs to be cleaned and preprocessed. Data cleaning is an important stage in data preprocessing. In data cleaning, certain techniques are used to remove irrelevant and non-significant items from the web server logs. In this project, the following are the steps of data cleaning.

[Figure 2.] Preprocessing Process (Cooley et al., 1999)

Step 1: Format the data. The data is retrieved from Nginx web server. It does not follow the conventional Common Log File (CLF) and Extended Log File (ELF) format.

Step 2: Remove image files such as .jpg, .gif, .css and all folders contain images

Step 3: Remove HTTP status code other than 200. Status code 200 denotes as the request is successful. Other HTTP status codes found are 302, 304 (Not modified) and 404 (Not found).

Step 4: Remove request method other than GET and POST. HEAD request method is considered irrelevant because it returns only headers in answer without content (Nicholas et al., 2004). Other request method such as PUT, DELETE, TRACE, CONNECT may contain bad request, properties of the server or visits of robots.

After the data has gone through extensive data cleaning, the next step is to identify user. A user can be defined as someone trying to access the web pages from the web server. In this paper, the following rules are observed (Dixit and Gadge, 2010).

New IP address indicates new user

If there is same IP Address, but the log files show different user agent, it represents new user.

3.2 Cache Server

Berita Harian uses cache server to expedite service requests by clients. This can be achieved because the cache server keeps local copies of frequently requested resources. If a user re-request the same data from the server, the cache re-send the same answer without requesting the server. The goal of caching to eliminate the need to send requests in many cases, and to eliminate the need to send full responses in many other cases. In web server logs, the cache status is either indicated by HIT, MISS, EXPIRED, UPDATING or STALE. HIT means that the page requested is available in the cache; MISS means that the request in not available in cache and to be read from the web server, EXPIRED happens when the cache age has expired. The use of cache server may cause problems of underreporting of site traffic, loss of referring site information and identifying site’s usage. Proxy level caching could also cause a single request to be viewed by multiple users throughout an extended period of time. Consequently, user session identification will be difficult, because it is an arduous task to determine when the user’s session is actually over (Srivastava et al., 2000).

3.3 Preprocessing Algorithm

The following is the extract of algorithm for preprocessing, done in Python 2.6. The web server logs are given to us are in the .tar format. The first step taken was to compile the web server logs based on the format. Most log files have their own unique characteristics format. As for this web server logs, we standardized the format according to date, time, time zone, IP address, cache status and cache control. Once the format is ready, we search the HTTP request based on Nginx HTTP log, which are method, path, protocol, status and browser. Then, the first stage of preprocessing of data cleaning; remove unnecessary image files. Here, we used regular expressions to remove all image files in the page request. Sometimes, the images are saved in folders, because Berita Harian always have gallery of images for their special content such as election pages, special events like Election, World Cup, images for button ads and many more. Once the images have been removed, the .tar file is parsed and put into a new database. The last step is to display the results in graph. The results are divided to status codes, cache status, HTTP method, browser and operating system of users.

Step 1: Compile the log file based on format desired; which is date, time, time zone, IP address, ca-che status and cache control

Step 2: Search HTTP request based on Nginx HTTP log; which is method, path, protocol, status and browser.

Step 3: Define the regular expressions to remove all images in the page request

Step 4: Read the log files. The log files is in the format of .tar

Step 5: Put each entry of log file into database named as totallog

Step 6: Count each HTTP status code (200, 302, 304, 404), count each cache status (MISS, HIT, EXPIRED), count each HTTP method (GET, POST, HEAD), count each browser (Internet Explorer, Firefox, Mozilla, Safari, Chrome, others), count each operating systems (Windows, Linux) display in chart

4. RESULTS

An experiment using web server logs was conducted to test our algorithm. For this experiment, we used 750MB of data, which results to 401809 entries of logs. The first step in our data cleaning stage is to remove all images, which include .gif, .jpeg, and .css. Due to the format of the log file, some of the images are hidden in folder. Therefore, the log file has to be examined carefully to find image folders, as well as the image files. Table 1 shows that the number of log files has considerably decrease after all the images are removed, from 401809 to only 44014, which constitutes approximately 11% of the original data.

[Table 1.] Results of Log Files After Removing Image Files.

Results of Log Files After Removing Image Files.

After the images are removed, the next step is to filter the status code. Figure 3 shows the different status codes were identified, and as a result, only status code of 200 is used.

[Figure 3.] Web Server Logs Based on HTTP Status Code.

Figure 4 is the results of user identification. After the IP address of each user is identified, the users are further divided into different user agents. This is based on the rules states that if the IP address is the same, but user agent is different, then it denotes different user. From the graph, the highest user agent is Internet Explorer, followed by Firefox, Chrome, Opera and Safari. Other user agents include accesses from browsers used in wireless devices such as smart phone, iPhone or Blackberry. Figure 5 illustrates the user identification based on list of IP address, the different browser of each IP, and the page requested. In this figure, user is identified based on their IP address. Although there are many same IP address, but if the page is accessed from different browser, it shows that they are different users.

[Figure 4.] User Identification Based on Browser.

[Figure 5.] Results of IP Address, Browser and Page Request.

5. CONCLUSION

In this paper, we presented our detailed of preprocessing phase, which is used to clean web server logs. By using our script in Python 2.6, we define the regular expressions and provide rules for every requirement we need to clean. The experiment conducted has successfully cleaned the web server logs from unnecessary and non-significant information. The testing from the script shows the importance of preprocessing phase as it not just reduce the log file size, as well as increase the quality of available data, which will be used in the pattern discovery phase in the web usage mining phase later. Moreover, there are still issues that need to be resolved such as identifying session and transactionization. Future study will identify appropriate measure to session the data, due to the fact that cache server is used to access the most recent page request by the client.

참고문헌

1. Batista P., Silva M. J., Silva M., Grande C. (2002) Mining On-line Newspaper Web Access Logs [Proceedings of the AH’2002 Workshop on Recommendation and Personalization in eCommerce] P.100-108
2. Choa Y. H., Kim J. K., Kima S. H. (2002) A personalized recommender system based on web usage mining and decision tree induction [Expert Systems with Applications] Vol.23 P.329-342
3. Cooley R., Mobasher B., Srivastava J. (1999) Data Preparation for Mining World Wide Web Browsing Patterns [Knowledge and Information Systems] Vol.1 P.5-32
4. Dixit D., Gadge J. (2010) Automatic Recommendation for Online Users Using Web Usage Mining [International Journal of Managing Information Technology (IJMIT)] Vol.2 P.33-42
5. Elsheikh S (2008) Web Usage Data for Web Access Control (WUDWAC), [Proceedings of the World Congress on Engineering]
6. Hao T., Brimmer D. J., Lin J. M. S., Tumpey A. J., Reeves W. C. (2009) Web Usage Data as a Means of Evaluating Public Health Messaging and Outreach [Journal of Medical Internet Research] Vol.11 P.99-118
7. Vellingiri J. S., And Pandian C. (2011) A Survey on Web Usage Mining [Global Journal Of Computer Science and Technology] Vol.1 P.4343-4350
8. Kumari V. V., Raju K. S. (2010) Understanding User Behavior using Web Usage Mining [International Journal of Computer Applications] Vol.7 P.162-286
9. Markellou P., Rigou M., Sirmakessis S. (2005), Mining for Web Personalization, in Scime, A. (Ed.) Web Mining: Applications and Techniques P.27-48
10. Mobasher B.,, Dai H., Luo T., Sun Y., Zhu J. (2000) Integrating web usage and content mining for more effective personalization [Proceedings of the First International Conference on Electronic Commerce and Web Technologies, LNCS, 1875] P.165-176
11. Murgue T., Jaillon P. (2005) Data Preparation and Structural Models for Web Usage Mining [SETIT International Conference: Sciences of Electronic, Technologies of Information and Telecommunication.]
12. Nicholas D., Huntington P., Williams P., Dobrowolski T. (2004) Reappraising information seeking behavior in a digital environment [Documentation] Vol.60 P.24-43
13. Pitkow J. (1997) In search of reliable usage data on the WWW [Sixth International World Wide Web Conference] P.451-463
14. Srivastava J., Cooley R., Deshpande M., Tan P. N. (2000) Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data [ACM SIGKDD] Vol.1 P.12-23
15. Sanjay B., Thakare S. (2010) A effective and complete preprocessing for Web Usage Mining [IJCSE International Journal on Computer Science and Engineering] Vol.2 P.848-851
16. (2011) Status codes
17. Tanasa D., Trousse B. (2004) Advanced Data Preprocessing for Intersites Web Usage Mining. [IEEE Intelligent Systems] Vol.19 P.59-65
18. Tyagi N. K., Solanki A. K., Wadhwa M. (2010) Analysis of Server Log by Web Usage Mining for Website Improvement [International Journal of Computer Science Issues] Vol.7 P.17-21