The following paper discusses how web mining applications and its benefits to e-commerce. The paper begins with an introduction on how web mining has brought success to St. John Health System, and then moves on to define three different categories of web mining (web content mining, web structure mining and web usage mining). The paper then focuses mainly on how web content and web usage mining is helpful to e-businesses. The four major steps in web usage mining (data collection, preprocessing of data, pattern discovery and pattern analysis) are explained. Sources of web usage mining such as browsing histories, clickstream data and web logs are explained and how the data from these sources are helpful to the e-business. Examples are given on clustering and classification and how these algorithms in web usage mining are beneficial towards targeted advertising and personalization. The paper also describes several web mining technologies. Among them include WEBMINER, ClickTracks and Google Touchgraph. Lastly, the paper discusses the challenges and limitations in using web mining for e-commerce.
Introduction
St. John Health System is a health care system with 8 hospitals and 125 medical locations. St. John's website "tracks satisfaction among transactions, such online registration for health assessments and scheduling of physician visits, all in an effort to determine how many new patients the website is responsible for driving into the health system" [20]. By doing this, "St. John has seen a 15 percent increase in new patients" [20] and return on investment of 400 percent "despite a highly competitive health market and a declining consumer population" [20]. St. John's story is among many that show how web mining systems has been used to drive the organization forward [20].
If there was a math function to describe the growth of the web, that function would be exponential. This explosive expansion of the web has led many organizations to put their information on the web to deal with this shifting change in sources where people obtain their information [9]. However as size increases, so does the difficulty to manage this information. This gives rise to having both client and server side systems that can intelligently mine for knowledge. Web mining is a technique that seeks to achieve this by extracting useful information from web pages [9]. It is used to gather customers' online activity to help businesses by providing high level knowledge in the form of rules and patterns that describe consumer navigational and purchasing behavior [11]. Web mining is categorized into three main categories: web content mining, web structure mining, and web usage mining.
Web Content Mining
Web content mining uses data mining techniques to examine content published on the web. This content may come from things such as: emails, HTML and XML documents or photo objects. Web structure mining analyzes hyperlink structure contained within these documents. Web usage mining analyzes the user's interactions with the Web server through things such as Web logs, click streams and database transactions. It aims to determine how users behave and interact with the Web [5].
Web content mining can be used for searching and filtering of information, categorizing of documents, searching of similar web pages on different servers and identification of themes in different web documents [2]. E-businesses use web content mining to collect information [4]. It helps them to predict what customers want as well as helping them to gain a better understanding of the market's needs in order to develop optimization strategies. The business can come up with a better decision on what products to be purchased and what kind of services should be provided [8] which makes the business more competitive. Additionally, this can also help the business reduce operating costs by carrying out precise market activities [12].
Web Usage Mining
Web usage mining aims to discover and analyze patterns automatically. Knowledge obtained from usage patterns can be directly applied to efficiently manage activities related to e-businesses and e-services [1]. Applications of web usage mining include: customer profiling, personalization of online services, and product and content recommendations. Accurate web usage information can also help to improve: cross marketing and sales, the effectiveness of promotional campaigns and tracking of leaving customers [1].
Web usage mining can be used to extract log files containing information on user navigation. This is done so that hyperlinks similar to the ones previously clicked by the user can be recommended to him or her [8]. Learning about online users' surfing behavioral patterns based on their browsing habits can help e-businesses provide personalization to their customers. Just to give an overview on how this is accomplished, when users surf websites, the web server will automatically gather and save the users visiting information in a web log file. A web page generated can then be tailored to a specific user based on that user's information in the log file. Providing a customized web page to each different online user makes the user feel as if the website was specifically designed for him or her. This can improve customer satisfaction and increase sales because the customer would be presented with products that they are potentially interested in buying [8]. This is an important technique as it saves the user time and makes the website more convenient and easy to use which in turn increases the chances the user returning to the site and recommending it to others.
Studying browsing behavior can also help web designers to redesign and modify the structure and appearance of the website according to visitors' information. Contents and links in web pages can be arranged in order of content relevance [12]. Improving the e-business's website structure and layout can make it more user friendly so that customers would be able to quickly locate content or products of interest. Not only will they be able to find what they want more efficiently, it would also help prevent customers from leaving the website and hopefully attract new visitors to the site as well [8]. The aim of web personalization is to help users cope with information load and to automatically filter relevant new information [13]. Learning the interests of online browsers is most useful when trying to target advertisements and promotions more effectively [8]. Instead of trying to guess what the customer may want, the company can now direct their promotions and sales using a more tailored method. This helps the business to retain their customers and also helps them to reduce cost [13]. Their budget can be spent more wisely in this way. After all, finding a potential customer is five times more costly than retaining an existing customer [10]. An adaptive and personalized website will filter new content according to user preferences [13].
E-businesses also utilize web usage mining in another way. The technique can be used to discover interesting patterns and statistical correlations between web pages and user groups. This step often leads to automatic user profiling, and usually is done offline in order to avoid adding a burden on the web server. User profiles can be built by combining user navigation paths along with page viewing time. User behavior can then be anticipated in real-time by combining current navigation patterns with those extracted in the past [11].
Since web usage mining is used frequently and is important to the daily activities of an e-business, they have to ensure it satisfies certain requirements. A web usage mining system utilized for e-commerce businesses must be able to gather useful usage data thoroughly, filter out the irrelevant data, establish the actual usage data, discover interesting navigation patterns, display the navigation patterns clearly, analyze and interpret the navigation patterns correctly, and apply the mining results effectively. One of the major challenges in handling this onflow of data collected is figuring out how to group user requests clearly to identify the path they took during navigation through the website [1].
Data Collection & Preprocessing of Data
There are four major steps involved in web usage mining: data collection, preprocessing of the data, pattern discovery from the data, and pattern analysis [13]. Data is mainly collected from Web logs which can be broken down into: server logs, agent logs and client logs which are all located on the Web server [8]. Collected data is usually unclear and a lot of overlapping occurs. In order to make the data more useful, preprocessing is needed to help clean the data by eliminating any irrelevant data [8]. In addition to eliminating irrelevant data an noise, preprocessing also includes identifying unique visitors and recovering user sessions [13].
Pattern Discovery
Once the data has been cleaned, the next step will be pattern discovery which employs three web data mining methods. These methods are: statistical (or data) analysis, knowledge discovery and prediction models. Statistical data analysis attempts to make use of mathematical models to explain regulations in the data. Knowledge discovery uses a data searching process to extract information from which we can infer business rules. Prediction models are based on the assumption that consumer behavior have certain repetition and behavior. Businessmen can predict consumer behavior based on information stored in their database and classify consumers according to their specific behavior allowing them to derive pertinent marketing strategies [12]. Additionally, algorithms such as clustering and classification will be also be used to unveil useful patterns in the data [8].
Clustering
Clustering is the process of identifying items or groups of items that share a certain characteristic. Clusters can be broken down into two types: usage clusters and page clusters. Usage clusters deal with establishing groups of users that display similar browsing patterns while page clusters deal with web pages that have related content [6]. Clustering is an important technique used in e-commerce. Each cluster may represent a certain browsing pattern within the site [3]. Clustering helps businesses infer user demographics which helps them perform market segmentation and it can also be used to provide personalized web content to users [16]. When a user browses a site, a window of their last n number of pages is maintained. Every time a user visits a new page on the site, the partial session is matched against existing web page clusters. This can be used to classify the user into what is known as a web usage community. A web usage community is a group of users with similar usage patterns of the website. The community model information can then be used to dynamically provide the user with custom content or targeted advertising [3]. Note that a user might fall into more than one community. Users can also be classified based on how credit worthy they are. Users in different credit categories will be given different levels of authorization. By doing so, e-business can achieve the purpose of enhancing security [12].
Clickstream Data
The purpose of of clustering users based on their clickstream for a website is to establish groups of users with similar interests and motivation for visiting that website. If the site is well-designed, then there will be a strong correlation between the similarity of users' clickstream and their interests [9]. Clickstream data can be used to explore, model and predict user behavior. Clickstream data includes: the date and time of a user click, the URI (Uniform Resource Identifier) of visited web sources and some sort of identifier to the user such as an IP browser type or in the case of authentication, a user name . Special software can also be installed on the client side to collect data on scrolling activity, the active window opened or actual page views [13]. Information may also be extended to include user registration information, search queries, geographic and demographic information, the amount of data transferred in bytes and the referrer and agent to the site [1].
Backtracking Patterns
Backtracking patterns of users can be inferred from clickstream data. Backtracks can be found by looking at sequences in the preprocessing stage of data (discussed below). A backtrack is defined as a page viewed before proceeding to the previous page viewed. So for four pages (ABCB) viewed in that order, C would be a backtrack page. Backtrack pages are deemed to contain little relevance or are of no interest to the user. If a product is backtracked, it can be selected to appear lower down a list of products so that products of more interest can be displayed first. Pages of products with a high backtracking count may be undesirable to the user or may be priced too high. Flagging backtracking pages can help businesses target products that are selling undesirably and perhaps remove them from inventory [3].
Probability of Next Page Viewed
Determining the probability that the next page would be viewed by the user is also useful. This allows the business to recommend products on the current page that the vast majority of users would likely browse on their own. This allows users to find information much quicker than through manual navigation [3]. Another method that can be deployed would be to offer the product on the current page to be purchased as a bundle with the product that most likely to be viewed next. This can not only increase revenue but also give the user the satisfaction of reaching their goal in less time [3].
Classification
Businesses perform research on the different types of products that appeal to different clusters of the population. Additional information from data warehouses can be used to build classification rules. Classification models help to determine interests of users and predict trend. For an electronic auction company that provides information about items to auction and previous auction details, predictive modeling can be used to analyze existing information and to estimate the values for auctioneer items or the number of people participating in future auctions [7].
Classification can also be applied to path analysis. The figure below represents a schematic diagram for the Orient Movie Theatre e-business website.
Content of the website includes film introduction, ticket pre-sale, meat (drink and food), and gift (e.g. flowers) service. Films are divided into two categories denoted by Film subset A1 and Film subset A2 in the graph. There are three films under Film subset A1 and two films under Film subset A2. The ticket ordering page contains one meat service page and one gift service page. Each node in the diagram represents a single page in the website and each traversal from the root node to a leaf node represents a possible navigation path taken by the user. Suppose the company chooses {A11·B, A111·B, B·C, A11·A111} to support classification. Access patterns of customers can then be divided into the following three categories:
Potential customer:{A11·A111}
Valuable customer:{A11·B}, {A11·A111, A111·B}
More Valuable customer: {A11·A111, A111·B, B·C}, {A11·B, B·C}
A potential customer would be a user who visits the website and browses film information for one film. A valuable customer would be a user who reads the information for a film and makes a ticket purchase. A more valuable customer would be one that does all of what a potential and valuable customer does and adds meat service to their order as well [7].
Pattern Analysis
The discovery of web usage patterns would not be very useful unless there were mechanisms and tools to help the analyst understand them and how they can be applied. Pattern analysis techniques draws upon a large number of fields. Among these include: statistics, graphics and visualization, usability analysis and database querying [7]. Visualization techniques have been very successful in helping people understand people understand the behavior of Web users. Database querying can help to specify the focus of the analysis. Since the amount of data to mine is usually large, constraints can be placed to restrict a portion of the database to be mined. Usability analysis involves three steps. First data is collected. Second, computerized models and simulations are developed to explain the data. Lastly, visualization techniques are used present the results. The effort in usability analysis is develop a systematic approach to usability studies [7].
Targeted Advertisement
Effective advertisements are part of the different techniques employed by businesses to ensure profit maximization. Advertising is an important technique employed by businesses to increase awareness of their products amongst consumers. Companies are often willing to spend a lot of their resources on ads because of the potential benefits. Browsing histories can be used to target advertising more effectively by determining products users are looking at [17].
Dynamic programming is a technique that can be used to determine the next advertisement to show to a consumer. It involves state transitions (including seeing the advertisement, closing it, deciding to purchase the product), which offer useful clues of what to show the consumer next. For example, if a consumer waits for a while before closing a pop-up advertisement, this could mean he or she is reading the product description carefully and so the next pop-up should advertise a similar product. The exploration component can be used to offer a small number of ads randomly to learn more about the user. When useful information is gained, the model can then offer advertisements that appeal to the user [17]. The key to success in driving revenue from online sales is the ability to predict the needs and expectations of online customers. The market is evolving from seller monopolization to purchaser monopolization. This transformation makes the consumer mind and behavior take on news characteristics and trends compared with that of the past [14]. Many factors influence how people act within society with regard to the market such as a recession or market boom [3]. Businesses also know that consumers' shopping habits change over time for several reasons (they move from a cold city to a warmer one, they have children, a new fad has emerged etc). They should therefore ensure their marketing strategies change too to keep consumers interested. If businesses want to make sure their marketing models are tracking changes in society and personal differences, the exploration model (mentioned above) should sometimes make random recommendations to the user [14].
The market strategy to gain new or attract potential clients is to classify existing visitors first. When there is a new visitor, his or her characteristics can be compared against the descriptions of existing groups to judge if this person would be a potential client [12]. It should be taken note however, that there may not be sufficient information about this new visitor to compare with existing groups.
WEBMINER
There are quite a few systems out there that can help with the web mining process. A few will be presented here. WEBMINER is a system that automatically discovers association rules and sequential patterns from server access logs. Algorithms are used by the system to perform user traversal path analysis such as identifying the most traversed paths. It can scan multiple sites for desired content within one session. It also filters downloaded content and includes options for dealing with duplicate files found. Once user transactions or sessions have been identified, graphs can be formed to perform path analysis. Graphs represent some sort of relation defined on web pages [11] such as the one show above for the film website.
ClickTracks
ClickTracks is a website analytics software. It allows marketers to track sales in a new way. Instead of reporting basic statistics such as total revenue, Clicktracks allow the dollar amount of a transaction to be traced back to originating advertisement or search engine keyword. Visitors can also be categorized and color-coded based on how they found the website. A visual interface is provided to superimpose visitor data on the website. The business can see where visitors go, how long they stay on a certain page, and from which page they exit. The software also provides a keyword ranking feature for each term entered into a search engine that brought visitor traffic to the site. The performance of those keywords can then be compared in terms of the number of visitors, the amount of time that they spent on the website, the cost, the revenue, and the return on investment. Clicktracks also provides fraud detection through potential indicators of fraud such as:
No referrer
No keywords
Concentrated clicks
In addition to detecting possible fraud, the software also allows for sorting out false indicators of fraud such as a poorly structured ad [18].
Google Touchgraph
Google Touchgraph is an experimental metasearch engine product that can map results from Google or Amazon [11]. It is used to discover and illustrate relationships in form of mind maps. These relationships can be tweaked through various options and filters and can reveal to the business how their website is connected to the network of other e-commerce sites on the web. The software also allows loading and integration of documents in several different formats. A screenshot of the software is shown below:
The graph shows that although Zappos is best known for selling shoes, it also shows up as a top search for jeans and clothing [19].
Other Web Mining Technologies
Other web mining technologies include Web Analyst and X-Sell Analyst which is built on top of Poly Analyst. Poly Analyst combines dating and text mining techniques as a single tool. Both Web Analyst and X-Sell Analyst aim to support online retail sites in maximizing returns from their customer base [11]. Visible Path mines company messaging sources such as e-mail and instant messaging to understand how networking can enhance sales for a company [11]. WebFoundation collects massive amounts of unstructured and semi-structured text and converts it to XML tagged information prior to mining for patterns and trends [11].
Challenges and Limitations of Web Mining
As exciting as it may sound to be able to extract useful information from a variety of sources and vast amounts of data, the algorithms that work behind web mining are actually quite complex and difficult to develop.
The World Wide Web seems to possess an infinite amount of information that seems to never stop growing in size. When talking about extracting useful data from all this information, a question that may arise is: is the information and patterns that is discovered today going to be useful tomorrow? With a place that is as dynamic as the Web, it is hard to say. When mining for information that can be useful for e-businesses, the data collector needs to collect from a variety of sources, including: sales transactions, web page views, openings of emails as part of a campaign effort... how much data needs to be collected to be considered good enough to draw a conclusion? Measuring the effectiveness of an advertising effort requires not knowing just how many people clicked on it, but also how many of those people actually made a purpose after viewing the ad [6]. After all, the point is to pinpoint the prospective buyers and not just the "browsers."
Another problem that lies in evaluating click-stream data is the volume of data that is generated. It is not uncommon for a single site to have tens of millions of page requests per day. Add to that, data collected from specific events used to track consumer behavior and effectiveness of personalization and there is quite a hefty amount of data to be stored. Collecting so much data is both infeasible from a storage perspective and the impact on performance it may have on a website. Possible solutions to this? Well, the data could be sampled, and only a subset of that data is collected. The problem with this method? The results could result in recording of incomplete sessions [6]. Mining techniques need to scale better in order to handle large amounts of data. It is hard to draw a knowledgeable conclusion when a session is not evaluated in full because there can be many external variables that have been missed.
Say, web mining successfully produces some useful data for the organization. The resulting information is represented in the forms of models. More often than not, theses model are rarely understood by the typical business man [6]. New ways need to be found to present models to business users or new types of models need to be defined. Models are important because they can be useful for business insight, generating scores and prediction that can later be used in personalization, or it can be directly deployed to form the basis of a real-time recommendation engine [6].
When talking about decision-support making to better an e-business organization, the solution doesn't lie in web mining alone. External events need to be taken into account and should be modeled as well such as: marketing campaigns, promotions, media ads, and site redesign patterns also affect e-business sales. Providing support for this dynamic environment can be challenging. It is not just about a redesign of the organization's website, it is also about visitors' demographics. People get married and children grow. Product attributes may also change. There may be new choices available such a new color or design. The challenge lies in keeping track of these changes and providing support for these changes for analysis [6]. This again reinforces the problem with extracting knowledge that may be useful today but may become old news tomorrow.
Detecting behavioral changes in users is essential to triggering model updates. A user planning to buy a TV may browse through the available selection of an online store for several days or weeks and then abandon the topic completely for years upon purchasing or may return several years later to make a purchase [13]. How do we know that a user whom is browsing a product and does not make a purchase is interested in it or not? The user may very well be interested but just not be able to afford it. This problem also arises when a visitor is looking at a page that is displaying 20-30 products at once. A user may be interested in a product among the 30 listed there but may not even click on the product to view additional information because the user can see right off from the price that the product is out of their price range and not within their budget. When a user whom is interested in a product and does not click on it, it is very hard to infer they were interested at all. Even if a browsing pattern has been discovered for a certain user, other considerations need to be taken into account. Is the product that the buyer has purchased a kind of product that people buy frequently, only once, several times a month, several times a year or once in a blue moon? In addition, it is also necessary to know the income of that person and their spending habits. A higher income could possibly induce someone to spend more and replace items more frequently but what if the person is stingy? The logic cannot be held true in that case than. Behavioral patterns of users can reoccur over time (such as alternating between account checking and investing behavior) or be influenced by the season (such as Christmas and birthdays) [13].
The rundown of the four major challenges faced by the user wanting to extract information from the web [15]:
The Web is highly dynamic and volatile with constant addition of new data sources and frequent updating and removal of existing sources.
The page structure of websites change. This can invalidate information extraction systems that are built on top of data sources.
Users of the Web obtain information through many sites, not just one site.
Web pages with relevant data is scarce for a given topic of interest - prior research indicates that web-based data sources are not routinely used for business decision-making.
The fourth challenge holds true in the sense that there is a lot of "junk" out there on the Web. There is useful information too, but the combination of both creates a large-size combo information meal that requires a lot of weeding through to find some gold. To reinforce here again, looking through so much information is very time consuming. The second challenge also holds true. Evaluating the session of a user at one website is not enough to determine their buying pattern.