Web application can be analyzed to determine such things as the number of return visitors, purchases made by first-time buyers, the pages that get the most hits, how much time is spent on a page by user, which are the best-selling products, which products are most commonly visited and attract the user to visit next time, which pages are creating problems such as they took too much time to load or they containing heavy graphics and animation which causes the expiration of session and causes the user to terminate the site. All this information about the web application can be analyzed for the decision making means to take decision that which products are key products which causes the application to be accessed again and again and which pages causes the diversion of users from this application by focusing on these problem, We can enhance a customer's online experience by focusing on the materials that users like best. If done properly, this analysis should lead to increased traffic and sales. But these priceless chunks of information about customer behavior can be very difficult to find because there are as many as a billion records of raw Web data being generated every day on a popular site.
To understand what type of information can be gathered, consider the behavior of a certain user, who decides to buy a new pair of shoes online [1]. User signs onto the Internet and uses a search engine to find what sites dells in his favorite brand. As the results come up on his screen, user clicks on the first link. This takes him to an online shoe store, and user begins to browse the site according to his brand. User comes across the style and size that user needs and adds it to his shopping cart. User is ready to checkout and enters his personal information, credit card number and shipping address. The next screen displays the order information and total cost. After seeing company charges for shipping and handling, user decides to cancel the transaction and go back to the search engine to browse his shoes at a different online store. During this entire process, user's click-stream data has been collected by the Web seller, providing a detailed look at how user got to the site, the Web pages user viewed, the products user considered buying, and the point at which user left the site.
The specific click-stream data that can be collected includes information such as the client IP address, exact date & time, HTTP status code, bytes sent and bytes received, time to download, HTTP method, target site signifies, user Agent Explorer, query strings, means what he has entered and server IP address. If the user found the site through a search engine such as Google, we'll be able to determine the target page and the search words entered, and we'll be able to access the page of search results on which our site appeared. We'll also be able to track e-mail click-streams. This data is generated when an individual receives an HTML e-mail with scripting and clicks on a link or an advertisement.
Click-stream data can be a marketer's dream. It offers a detailed sight into the activities of a visitor to your site. Once analyzed, this information can help us to improve their online experience and hopefully increase sales. But to achieve this analysis in a rapid and efficient way, we'll have to cut through the huge amounts of data collected each day. Using high-performance application acceleration software will help reduce the volume of data and organize it into a significant format and structure. There are also many techniques you can use to make this data more manageable, including data extraction, data transformation, merge, and join, pattern matching field extraction, Web log format, data aggregation and data partitioning. By incorporating the various tools and techniques discussed, we'll be able to optimize the performance of system and improve the management of web application. Best of all, we'll quickly be able to expose the golden chunks of information about customer behavior that's buried inside.
1.2 Data web housing
By applying data web housing methodology on these golden chunks of data, the click stream data can be formatted into useful data that can be analyzed to achieve the goals. It houses and publishes click stream data and other behavioral data from the web that drive an understanding of customer activities. Data web house Act as a basis for web enabled decision making. The data Web house must permit its users to make decisions about the web, as well as make decisions using the web. Act as a standard that publishes data to the customers, business partners, and employees appropriately, but at the same time protects the enterprise's data against unattended use
1.3 RELATED WORK
WEB housing is an emerging concept in e-commerce, currently not many companies are in it, and some companies are now implementing WEB housing technology in their commerce products. Microsoft is planning in its next version of Commerce Server, web housing capabilities, by which the log files of the commerce site will be directly sent for data warehousing and can analyze the data and determine the date and sell out information. Amazon is still to use WEB housing.
1.4 Objectives
To analyze the number of return visitors, purchases made by first time buyers, the pages that get the most hits, how much time is spent by the user on a page, which are the best-selling products means key products, which products are most commonly visited and attract the user to visit again, which pages are creating problems such as they took too much time to load or they containing heavy graphics and animation which causes the expiration of session and causes the user to terminate the site. All this information about the web application can be analyzed for the decision making means to take decision that which products are key products which causes the application to be accessed again and again and which pages causes the diversion of users from this application by focusing on these problem, We can enhance a customer's online experience by focusing on the materials that users like best. If done properly, this analysis should lead to increased traffic and sales.
CHAPTER 02
CLICKSTREAM
2.1 INTRODUCTION
End-user application developers are increasingly developing web base applications. For the web base application it is necessary to have data base record for each user interaction, means every activity performed by user on web page. Web page is the choice of user interface development environment. In fact the web revolution has raised everyone's prospect much higher that all sorts of information will be like a dream published through web browser interface. The web's focus on the customer needs means what the customer wants and for what he looks keeping these aspects it provide useful information.
The marketing analysts, five years ago were saying, "Soon, there will be a databases recording for every customer interaction, no matter how seemingly insignificant" [2] Then, they would go on to predict that how the complete customer record, would allow to understand customer behavior, interests, and needs in greater depth than ever before. From that understanding of everything as Growth, profit, customer satisfaction and everything that business executive want could be known.
When the complete client information is known, the Internet has let the genie out from the bottle [2]. The "clickstream", which is the record of every mouse click or key stroke of every visitor to a Web site. Click-stream data can be a marketer's dream. It offers a detailed sight into the activities of a visitor to site. Once analyzed, this information can help to enhance the online experience and hopefully increase sales. But to achieve this analysis in a fast and efficient way, we have to cut through the huge amounts of data collected each day. Using high-performance application acceleration software will help reduce the volume of data and organize it into a meaningful format and structure. There are also several techniques you can use to make this data more manageable, including data extraction, data transformation, merge, and join, pattern matching field extraction, Web log format, data aggregation and data partitioning. By incorporating the various tools and techniques discussed, we will be able to optimize the performance of system and improve the management of click-stream data. Best of all, you'll quickly be able to uncover the golden chunks of information about customer behavior that's buried inside.
2.2 Data web housing methodologY
The Internet revolution has propelled the data warehouse over the main stage, because in many cases the data warehouse must be the machine control or analysis of web experience. To enhance this new responsibility, the data warehouse should be adjusted. The nature of the data warehouse should be somewhat different. As a result, data warehouses are becoming data web houses. The data Webhouse is a web instantiation of the data warehouse. The Webhouse plays a central and a decisive role in the operations of the web enabled business. The Data Webhouse will,
Houses and publishes click stream data and other behavioral data from the web that drive an understanding of customer activities.
Act as a basis for web enabled decision making. The data Webhouse must permit its users to make decisions about the web, as well as make decisions using the web.
Act as a standard that publishes data to the customers, business partners, and employees appropriately, but at the same time protects the enterprise's data against unattended use.
2.2.1 Features of the Data Webhouse
Data Webhouse must:
Be planned from the start as a fully distributed system, with many separately developed nodes contributing to the overall whole. In other words, there is no midpoint to the data Webhouse.
Not be a client/server scheme, but a Web-enabled system. This shows a top-to-bottom redesign. A Web-enabled system carries its results and exposes its interfaces through remote browsers on the Web.
Deal uniformly well with textual, numeric, graphic, photographic, audio, and video data streams because the Web already supports this mix of media.
Carries atomic-level behavior data to at least the terabyte level in many data marts, especially those containing clickstream data. Many behavioral analyses must, by definition, move slowly through the lowest level of data because the analysis constraints preclude summarizing in advance.
Respond to an end-user request in around 10 seconds, regardless of the complexity of the request
Comprise the user interface's effectiveness as a primary design criterion. The only thing that issues in the data Webhouse is the effective publication of information on the Web. Delays, puzzling dialogs, and the lack of the desired choices are all direct failures.
2.2.2 Sample Data Web house Architecture
As these design factors have become more complex such that supporting a broader range of users and requests. To address these problems, it needs to correct data warehouse architecture. It is difficult to make a single database server gradually more powerful.
The Fig.2.1 clearly clarifies the sample data ware house architecture.
Fig. 2.1
2.2.3 Hot Response Cache.
One way to take pressure off from the main database engines is to build a powerful hot response cache Fig: 2.1 that expect as many of the predictable and repeated information requests as possible. The hot response cache adjoins the application servers that feed the public Web server and the private firewall entry point for employees. A series of batch jobs running in the main Web house application server creates the cache's data. Once stored in the hot response cache, the data objects can be fetched on demand through either a public Web server application or a private firewall application.
The fetched items are complex file objects, not low-level data elements. The hot response cache is therefore a file server, not a database. Its file storage hierarchy will certainly be a simple kind of lookup structure, but it does not need to support a complex query access method.
The hot response cache's management must help, it support the application servers' requirements. Ideally, a batch job will have computed and stored in advance the information object that the application server needs. All applications need to be aware that the hot response cache exists and should be able to check out it to see if the answer they want is already there. The hot response cache has two separate modes of use; the nature of the visitor session requesting the data find outs which one to use.
The guaranteed response time request must generate some kind of answer in response to a page request that the Web server is handling, usually in less than a second. If the requested object (such as a custom greeting, a custom cross-selling scheme, an immediate report, or an answer to a question) has not been precompiled and hence is not stored, a default response object must be delivered in its place, all within the guaranteed response time.
The accelerated response time request hopes to produce a response to the Web visitor's request but will default to computing the response directly from the underlying data warehouse if the precompiled object is not found immediately.
The application server should optionally be able to warn the user that there may be a delay in providing the response in this case. The Web server needs to be able to alert the application server if it detects that the user has gone on to another page, so the application server can halt the data warehouse process.
2.3 The Flow of Click-Stream Data
Click-stream data is created using a business information infrastructure that supports a Web-based e-Business situation. To begin the process as soon as a user enters into site, a dialogue manager takes over and determines if this is a repeat or first-time visit. If the user has already has browsed the site, a personalization of the dialogue is done. If not, the user is sent a standard dialogue. Once the dialogue is completed, it is broken down into click-stream records, which are then stored in Web logs. These records are sent to the granularity manager where they are edited, aggregated, re-sequenced, and summarized. This helps to reduce the volume of data and organize it into a meaningful format and structure. The records are then entered into the data web house, usually on a historical, customer basis, where the data can periodically be refined and entered into the global operational data store (ODS). If needed, the click-stream records can also be entered into the local ODS, which is the ODS that exist inside the Web site. Such information would include simple transactions so that if a user came back to your site again in the day, the local ODS would remember the previous activity. The click-stream records in the local ODS can be used by the dialogue manager to tailor a dialogue for a user. This allows you to provide personalized messages for repeat users, which makes your site more attractive. It's important to remember that data passing from click stream records into the local ODS remains in the Web environment and never enters the granularity manger. The modification operation e.g. Aggregation, editing, selection and other processes have to be done manually.
This click-stream data can be analyzed to determine such things as the number of return visitors, purchases made by first-time user, the pages that gets most commonly visits, how much time is spent on a page, which are the best-selling products means key products, and much more. You'll also determine which advertising or e-mail messages are successful, and which ones aren't creating much of a buzz. All of this information can then be used to tailor your messages and greatly improve a customer's online experience by focusing on the materials that user like best. If done properly, this analysis should lead to increased traffic and sales. But this priceless piece of information about customer behavior can be very difficult to find because there are as many as a billion records of raw Web data being generated every day on a popular site.
2.4 CLICKSREAM POSTPROCESSOR
ETL (Extract, Transform and Load) and ELT (Extract, Load and Transform) are two data integration methods. ETL is made up of software that transforms and migrates data on most platforms with or without source and target databases. ELT consists of software that transforms and migrates data in a database engine, often by generating SQL statements and procedures and moving data between tables. The ETL tools have a head start over ELT in terms of data quality integration with Informatics and Data Stage integrating closely. The row-by-row processing method of ETL works well with third party products such as data quality or business rule engines [3]. Data integration is an essential part of data Web house. So a good ETL design and architecture is key point for success of data Web house. ETL process in the data Web house environment is called as click stream processor. Having identified the data that is available from the click stream now takes a look at how s click stream data can be fed to data Web house.
OLAP (online analytical processing) allows business users to slice and dice data as required. Normally data in an organization is dispersed in multiple data sources and are mismatched with each other. It may be a time consuming process for an executive to obtain OLAP reports such as, What are the most popular products purchased by customers between the ages of specific time.
Part of the OLAP implementation process involves extracting data from the different data stores and making them compatible. Making data compatible involves ensuring that the meaning of the data in one data store matches all other data stores.
It is not necessary to create a data warehouse for OLAP analysis. Data stored by operational systems, such as point-of-sales, are in types of databases called OLTPs. To do OLTP, online transaction process, databases have no difference from a structural point of view of all other databases. The main difference and only the difference is the way data are stored.
Examples of OLTPs can include ERP, CRM, SCM, Point-of-Sale applications.
OLTPs are designed for best transaction speed. When a consumer makes a purchase online, they expect the transactions to occur immediately. With a database design, call data modeling, optimized for transactions the record.
The following architecture as shown in Fig. 2.2 shows that how data is feed to data Web house by Appling Extract > Transform > Load process.
Fig. 2.2: The Web house Extract - Transfer - Load Architecture
This is very complex software development project. Click stream post-processing ultimately prepares the click stream data for loading into the data warehouse. Several goals must be met in the post-processor application, including the extraction of dimension keys for sessions, users and host.
Sort out unneeded records. Merges associated data and drop records that won't be passed through to the warehouse. Reduce transaction volume as much as possible without compromising the integrity and completeness of the data needed to support the design granularity of the warehouse.
Identify session. Tag associated click stream records with a unique session ID, and verifies that event times are logically consistent with one another among the records that describe the session.
Identify user. Match the user with an existing user ID, if possible. Otherwise, assign a unique anonymous user ID if identity is unknown.
Identify hosts. Tenacity (to the desired granularity) the IP address of clients and referrers. Retain country-of-origin and canonical domain data.
The numerical IP address of most internet hosts can be set on into their text equivalent using the ns lookup program.
2.5 Post Processor Architecture
It is possible to implement the click stream post-processor either as a streaming, transition-oriented application, a batch application or some combination of these. Each user click generates several log records that finally reach the click stream post-processor, because a single page click spawns separate requests for each object embedded on the page, such as GIF or JPEG graphical images. A streaming post-processor architecture requires consolidated and dimensions such as session ID are computed. In such an implementation the staging area is likely to be an
OLTP database. The consolidation and identification task will be performed by parallel application daemons that massage the raw transaction data, identifying sessions, hosts, and other dimensions.
A batch post-processor architecture as shown in Fig. 2.3 implies a cascade of sort merge steps with file-oriented data staging between steps. The consolidation and identification tasks are performed by serial applications that run between the sort and merge processors. Fig.2.3 illustrates the process flow in a click stream post-processor. This illustration is meant to show major processes, independent of whether the post-processor is implemented as a streaming or as a batch application. In either case select an implementation that supports parallel data paths and multithreaded processes-parallel processing almost always enable scalability.
Fig. 2.3: Inside Click stream Post-Processor
2.5.1 The page event extractor
The page event extractor collects log records from Web servers and from application servers, and merges them into page events. A page event may be routing (e.g., go to a page) or static (e.g., click a button on a page). The page event extractor allot an event dimension key that identifies the type of event represented by the record, such as opening a page or interning data. The extractor must have the ability to snub data that it does not understand or which is inconsistence or incomplete, and to report these records in a form that will allow the data extract manager to rectify problems with the data sources that cause such rejections.
The page event extractor may in fact add record to the click stream. If the applications are not able to log significant events (like shopping cart check out) to the click stream log flow, the page event extractor will have to obtains these from the suitable application server and add them as fake page events.
2.5.2 The content resolver
The content resolver inspects each page event record and attempts to narrate the page event to site-specific content. In order to do this, the contents resolver must have excess to various content indexes supplied by the application server. More specifically, the content resolver is responsible for creating two of the event record's dimension keys: its page key and its product key. The page key identifies a specific website static page ID or in a dynamic hosting environment, a template ID, this ID may be passed in the original log information, or it might need to be looked up by name in a site's content index.
2.5.3 The session identifier
The session identifiers main role is to collect and tag all the page events that occurred during a single, particular user session. The session identifier is responsible for two of the event record's dimension keys: its session key and its customer key.
2.5.4 Dwell time
The time on which specific page was actively displayed on the users screen.
Chapter 03
Data sources
3.1 Click stream
The main data source that will feed data Webhouse is the HTTP click stream itself. Web server maintains the log file record for every request means all the activity make by user is traced in log file. A click stream pos-processor receives the raw log data from the web server and normalized it in to format which can be easily combined with required application and easily piped that data into Webhouse. To go to the granular and detailed data describing the possible clicks on our Web server Each Web server may report different details, but at the lowest level we should be able to hit a record for every page with the following information: For exact date and time click the page, remote clients (requesting user) IP address; page requested (downloaded with path to page from the server machine), specific control and cookie information
address; page requested (downloaded with path to page from the server machine), specific control and cookie information. The most serious problem that infuses every analysis of the Web-click behavior is that the page requests often stateless means it does not remember that what activity user did last time. Without adjoining context, a page hit may just be a random isolated event that is difficult to read as part of a user session. Perhaps the user is linked to this page from a remote Web site, and then the left side, five seconds later with no return. It is difficult to make much sense to take advantage of such an event, so our first objective is the identification and labeling of complete sessions.
The second problem is whether we make sense out of the remote client's IP address. If the only identification is the client IP address, we will not have much to learn. Most user have internet service provider which assigns dynamic IP address to each user so it is difficult to find that user when he visits again. This problem can be solved if the web server creates cookies on client machine. Cookies usually does not contain much information but contain it recognize the requesting computer clearly. In order to make the unprocessed clickstream data usable in our data web house, we need to assemble and transform the data so it has a session point of view. This process has become an important step in the back room. We assume that we have some kind of cookie mechanism, which allowed us transform, our data source in the following format:
Exact date and time of the page hit
Identifying the requesting user from session to session
Session ID
Page and event requested.
3.2 Why click stream.
Click stream is one of the best ways of getting behavioral data of user than the traditional data sources. The click stream is the time series of every action that can be assembled into sessions. From which the information about user interests, that what user actually want what are their interests which type of visits a user perform most commonly. Either user is getting thing easily or he is facing some problem in his way which make the user confused. Through click stream all information about the user can be easily taken, so that an organization can make such plan that user can access all its interests in more detail. The web server logs every activity of user which he performs in the system.
3.3 Web server logs.
All the web servers have an ability log client activity into one or more log files or databases. These log or databases can be piped into real time application through web server common gateway interface (CGI). The web server log file is shown in following Fig. 3.1
Fig. 3.1: Web Server log file example
3.4 The Fields of Web server log file
Every web server has different log file format, but the data fields contain by all these log files only data fields order change. Some web server adds additional parameters but they have no any useful information [3].
3.4.1 IP address: "ppp931.on.bellglobal.com"
This is the IP address of client machine who has visited the site. Here the ISP has done the reverse DNS (Domain Name Server) to get the client machine name, which shows that user came into site via bellgiobal.com. If the DNS server is off then log will simply contain the IP address. E.g. "123.123.123.25"
3.4.2 Username etc: "- -"
This shows an anonymous user when the user access the password protect area then the site will prompt for user name and password, then the user name will be shown in this field.
3.4.3 Timestamp: "[26/Apr/2000:00:16:12 -0400]"
The time on which user is visiting the site including the date
3.4.4 Access request: "GET /download/windows/asctab31.zip HTTP/1.0"
This is the request of user. In this case it is a GET request (i.e. show me the page) for the file mentioned in above request. This request will fetch only the header of document.
3.4.5 Result status code:
The result status code shows the status of page either it is accessed successfully or the page is not displayed. There is number of status code for a page each code presents its own meaning you can visit them through. http://www.bigblock.com/support/wri_http.htm.
Mostly the code generated by web server are "200" which shows successful access of page and the code "404" shows page not displayed.
3.4.6 Bytes transferred: "1540096"
The number of bytes transferred means the size of file which is requested by client when the file size matches or exceeds the size of file requested that shows successful download, if the number of bytes is less then the size of file requested then it shows partial or failed download.
3.4.7 Referrer URL
http://www.htmlgoodies.com/downloads/freeware/webdevelopment/15.html
This is the URL which the user has typed in the browser means the user wants to visit specific page of the site it shows the interest of user. Some time user simply types the direct link of the required page in the browser. this also represent the tracing of user means when the user has typed the site URL then what he did in the site e.g. which pages he has visited.
3.4.8 User Agent: "Mozilla/4.7 [en] C-SYMPA (Win95; U)".
The "User Agent" identifier. The User Agent is software that the visitor has used to access this site. The "user agent" string is set by the software developer, and can be name they choose to be. But the most software developer uses the string that helps them to identify the client.
In this case "Mozilla/4.7" means Netscape 4.7, "[en]" implies it's an English version, "Win 95" indicates Windows 95 etc, it tells webmasters about the software which is being used to access their site
Chapter 04
Implementation
Click-stream data is created using a corporate information infrastructure that supports a Web-based e-Business environment. To begin the process, as soon as a user enters
Into web site, a dialogue manager takes over and determines if this is a repeat or first-time visit. If the user has already been to the site, a personalization of the dialogue is done. If not, the user is sent a standard dialogue. Once the dialogue is completed, it is broken down into click-stream records, which are then stored in Web logs. These records are sent to the granularity manager where they are edited, aggregated, re-sequenced, and summarized. This helps to reduce the volume of data and organize it into a meaningful format and structure [4].
To find the useful data, first we have to reduce the amount of data to analyze by applying ELT (Extract, Transfer and load) technique
4.1 Extraction
In the clickstream post processor extraction deals with extracting the data for the dimension table from log files. First we use the MS Access to import the log files then we can extract the required attributes from that data base.
To extract the data from log files we do the following steps.
Create a new data base
Import the data from log file into the data base by selecting "" External Data">> "text file"
As shown in Fig. 4.1.
Fig. 4.1: Extracting Data from Log File Step 1
Select the source log file and press OK, this will provide the wizard as shown in Fig: 4.2
Fig. 4.2: Extraction Step 2
Select the delimiter to be used from the above window and press next to load the data or press the advance to perform the transformation.
4.2 Transformation
The transformation phase applies a series of rules or functions to the loaded data.
This may include some or all of the following
Select only certain columns to load
Translate coded values
Derive a new calculated values
To transform the data we perform the following steps
When advanced button is clicked as shown in figure 4.2 then following window appear as shown in Fig. 4.3
Fig. 4.3: Transformation step 1
Name the fields which you have selected from the log table so that each attribute uniquely identified, for the analysis then press OK for transformation, then press next for applying transformation process when process is completed then following window will appear as shown in Fig. 4.4
Fig. 4.4: Transformation step 2
4.3 Loading
Loading refers to populating the data ware house with data that has been extracted from log file. There are two types of load
1. Initial Load
2. Incremental Load
Loading the data warehouse periodically
As we are loading the data for first time in the data ware house so we load the data in data Webhouse in two steps
1. Firstly load the schema in the data base
2. Finally load the data in data Webhouse
Firstly we load the schema into the SQL Server; following steps are performed for loading the schema.
Open the SQL Server2005 management studio then right click the databases folder and create database such as click stream. Now right click on the click stream data base >> tasks >> import data….
Following window will appear as shown in Fig. 4.5
Fig. 4.5: Loading step 1
Click the next and select the data source, as we have MS Access data base so we select "Microsoft Access"
Fig. 4.6: Loading step 2
Press next then select the source database as appears in Fig. 4.7
Fig. 4.7: Loading step 3
After selecting source data base click next, the schema will be loaded into the SQL Server
Now select the destination and the table where data is to be loaded as appears in Fig. 4.8
Fig. 4.8: Loading step 4
Press next button the window as shown in Fig.4.9: will appear, to verify the choice
Fig. 4.9: Loading step: 5
Now press the finish button to perform the execution when the process is completed then process successful window will appear as shown in Fig.4.10
Fig. 4.10
Now data is successfully loaded into SQL Server and it is ready for analysis as shown in figure 4.11
Fig: 4.11
The attributes of clickstream data base table
IP address
Username
Time stamp
Access Request
Result status code
Referrer URL
User Agent
Now we can apply the SQL quires for final analysis, as our aim is to analyses the web application for decision making.
The main objectives of this analysis are,
To Analyze the Web pages which are most commonly visited?
To analyze the web pages which cause "session killing".
To investigate the click profile of user on web site.
4.4 Analysis
select count(accessrequest) as timesVisited ,timestamp,accessrequest from clickStream
where timestamp between '[01/jan/2009:00:00:12-0400]' and '[07/jan/2009:00:00:12-0400]'
group by timestamp,accessrequest
This query shows the one week result that, which pages are most commonly visited.
select AccessRequest from clickStream
where ResultStatusCode = 408
This query will fetch the pages which are causing to be session killing
Select IP address, Username from clickStream
This query will fetch the IP address and User name from which we can know the click profile of the user mean either it is registered user or not
CHAPTER 05
Conclusion & FUTURE WORK
Once the click stream data is successfully transferred into a standard data base, then it becomes very easy for an organization to analyze the web application for decision making. Click stream offers a detailed sight into the activities of a visitor to web site, this information can help to enhance the online experience and hopefully increase sales. E.g. to analyze the number of return visitors, purchases made by first-time buyers, the pages that get the most hits, how much time is spent on a page by user, which are the best-selling products, which products are most commonly visited and attract the user to visit next time, which pages are creating problems such as they took too much time to load.
FUTURE WORK
Here we have used a user profile derivation approach. In particular, we use Web server log files as data source for user profiling. The general approach, however, is applicable to arbitrary log file formats. This allows for a broader perception of user behavior and has the potential to improve user profiling. In our future work, we will investigate possibilities to integrate automatically derived user profiles with explicitly provided user interests. We will build such system that will automatically perform this all analysis by tracing user through programmatically approach.