Functions Of Web Structure Mining Algorithms Computer Science Essay

Published: November 9, 2015 Words: 4371

The study of graphs and algorithms is very crucial in computer science. In the last few decades the use of algorithms has solved many difficult problems in a very easy manner. Therefore in many universities and colleges a lot of emphasis is made on the study of graphs and algorithms. A lot of research work is being done on this phase of computer science field.

Analysing an algorithm means the time taken by the algorithm to execute i.e. an approximation of the number of operations that an algorithm performs. The number of operations depends on the execution time; therefore time is used to describe an algorithm's computational complexity. The actual number of seconds that an algorithm takes to execute on a computer is not useful in analysis because the relative efficiency of algorithms that solve a particular problem is our main concern.

Graphs are a main part in composite systems and the use of graph algorithms to excerpt useful information for a problem is the main method of analysis for complex systems. A graph representation is used to provide a particular viewpoint on a problem and a graph hypothetical demonstration is useful for a particular problem depends on whether the interesting features of the problem are likely to be emphasized by the kind of idea that a graph-based representation may provide. The most common types of problem in which operational features play a vital role are problems involving multiple interrelating representatives in complex systems disciplines [2].

So my project is designing and implementing breadth first search on implicit graph for web mining application.

My web application will be educational application for BITS-Pilani, Dubai having 3 areas of interest:

Special Interest Group (SIG)

Conference Alerts (CA)

Higher Education (HE)

Objectives

The present work is intended to meet the following objectives

To survey the functions of Web Structure Mining Algorithms

To design BFS implicit graph and algorithm for web mining application

To identify the Academic Search related functions where Web Structure Mining and Breadth First Search can be applied effectively

Motivation

The net structure of the World Wide Web (www) is constantly changing due to the addition and removal of web pages (i.e. nodes) or the increase or decrease in the no. of incoming and outgoing links (i.e. edges) to or from a page. It is very essential to maintain the web structure in an organized way and implementing using BFS under Graph Structure Analysis.

Problem Description

BFS

BFS

BFS

Methodology

Phases 1: Literature Survey. (referring to journals, catalogues, websites etc)

Phases 2: Data Collection, Learning of Necessary Software/Hardware Tools and Installation & Configuration Steps, Problem Analysis.

Phase 3: Problem Analysis cont. & Design and Generation of Real Data, Coding.

Phase 4: Coding Contd. & Individual Modules Testing & Debugging

Phase 5: Integration Testing & Debugging, Implementation.

Phase 6&7: Final Touches / Fine Tuning, Summary of Findings, Recommendations, Conclusion and Report writing.

Phase 1

Sept6th-Sept15th

In-depth Research on Graph Structure Mining and breadth first search

Phase 2

Sept16th - Oct1st

Acquisition and installation of the intended platform.

Phase 3

Oct 2nd- Oct15th

Learning the coding language of platform

Phase 4

Oct 16th - Nov1st

Submission of the mid-semester report, Design and generation of real data, coding

Phase 5

Nov2nd-Nov15th

Coding.

Phase 6

Nov 16th- Dec 1st

Implementation and Testing.

Phase 7

Dec1st- Dec 15th

Fine tuning of the report, conclusion and submission of the final report.

1.5.1 Plan of Work

Fig 1.5.1 Plan of Work

Summary

In this chapter I discussed about graphs and its use in web mining applications. My project deals with design and implementation of breadth first search on implicit graph and java coding for web mining application i.e. an educational website. I also discussed about the problem description, methodology and plan of work.

CHAPTER 2

LITERATURE SURVEY

2.1 Graphs, BFS and Web Mining

For the running time of an algorithm for a given graph G=(v,e) the size of the input in terms of the no. of edges and vertices is measured. The asymptotic notation i.e. the O notation or Θ notation denotes the running time of an algorithm.e.g. the algorithm runs in time O (V E) means that the algorithm runs in time O (|V||E|).

A graph is a way of representing connections between points or places. Mathematically a graph is a collection of nodes and edges. Nodes are the points that are connected together by the edges of the graph. For example, if we had two small towns connected by a two-way road, we could represent this as a graph with two nodes, each node representing a town, and one edge, the road, connecting the two towns together. In the undirected graph the edge is a two way connection and in the directed graph the edges connect only one way.

Fig 2.1.1 a sample graph with nodes A, B, C, D, E, F [10]

In Fig 2.1.1 A, B….F are the nodes and the path between them are edges. The edges are directed edges (if shown by arrows) or can be weighted edges (denoted by some numbers assigned to them). Therefore a graph can be a directed or undirected and weighted or un-weighted graph.

There two ways to represent a graph are:

Adjacency Lists

Adjacency Matrix

For both directed and undirected graphs the amount of memory required by adjacency list is Θ (V + E).Adjacency lists are used to represent "weighted graphs" by a weight function w: E → R.

The problem with the adjacency list depiction is that there is no faster way to decide if a given edge (u, v) is present in the graph or to search for v in the adjacency list Adj[u]. This disadvantage is taken over by the adjacency matrix representation of the graph but adjacency matrix uses more memory [3].

The adjacency matrix representation of the graph G consists of a |V|x|V| matrix A = (aij) such that

aij= 1 if (i, j) E

aij= 0 otherwise [6]

The adjacency matrix of a graph requires Θ (V2) memory (or the running time) autonomous of the number of edges in the graph.

The adjacency matrix representation can also be used for weighted graphs as the adjacency lists.

2.1.1 Breadth First Search

In a breadth first traversal we visit the starting node and then on the first pass visit all of the nodes directly connected to it. In the second pass we visit nodes that are two edges away from the starting node. With each new pass we visit nodes that are one more edge away. As there can be cycles in the graph it is possible for a node to be on two paths of different lengths from the starting node. When we visit that node for the first time along the shortest path from the starting node that is not considered again. Therefore, either we need to keep a list of the nodes we have visited or we will need to use a variable in the node to mark it as visited node to prevent multiple visits [6].

Using the adjacency matrix representation and assuming that the number of vertices is N, breadth first search can be written like this

Algorithm

Void BFS (int startingVetex) {

IntQueue queue; // Start with an empty queue of integers.

for (int i = 0; i < n; i++) {

visited[i] = false; // Mark all vertices unvisited.

}

visited [starting Vertex ] = true;

queue.enqueue (starting Vertex);

while (! queue.isEmpty ()) {

int v = queue.dequeue (); // Get next vertex from queue.

process (v); // Process the vertex.

for (int u = 0; u < n; u++) {

if (edge Exists[v][u] && visited[u] == false ) {

// we have an edge from v to an unvisited vertex;

// mark it as visited and add it to the queue.

visited[u] = true;

queue.enqueue (u);

}

}

}

}

In this algorithm each vertex that is encountered is marked as visited and placed on the queue. It is processed when it comes off the queue. If it is encountered a second time (by an alternate path from the starting vertex), it has already been marked as visited and so is not placed on the queue again. Since no vertex is placed on the queue more than once the while loop is executed max N times. The run time for this algorithm is O(N2) where N is the number of vertices. In the edge list representation the run time would be O (M) where M denotes the no. of edges.

For this algorithm the first vertex on the queue is the starting vertex. When the starting vertex is removed from the queue all vertices that are connected to the starting vertex by a single edge are located and are placed on the queue i.e. the vertices at distance one from the starting vertex. As these vertices are removed from the queue, the vertices to which they are directly connected are added to the queue. These are the vertices at distance two from the starting vertex and they follow all the distance one vertices on the queue. As the distance two vertices are processed the distance three vertices are encountered and are added to queue following all the distance two vertices and so on. Vertices are added to the queue in order of their distance from the starting vertex i.e. they are added in breadth first order .

2.1.2 Analysis of BFS

Before deliberating the various properties of breadth first search, scrutiny of the running time on an input graph G is done. The operations of enqueue and dequeue spends O (1) time. So the total time taken by queue operations is O (V). The total sum of the lengths of all the adjacency lists is Θ (E), therefore the total time spent in scanning adjacency lists is O (E). The running time for initialization is O (V). Therefore the total running time of BFS is O (V+E). Thus breadth first search runs in time depending upon the size of the adjacency list of graph G.

2.1.3 Web Mining

Web mining is the combination of information collected by data mining methods with information grouped over the World Wide Web. Web mining is used to understand customer behaviour, estimate the efficiency of a particular web site and help compute the success of a marketing operation [7].

Web mining allows us in outlining the data through content mining, structure mining and usage mining. Content mining is used to observe data that is collected by search engines. Structure mining is used to scrutinise data related to the structure of a particular web site. Usage mining is used to survey data related to a user's browser as well as data gathered by forms succumbed by the user during web dealings [7].

The information collected through Web mining is estimated (using software graphing applications) by using data mining parameters such as clustering and classification, association and examination of sequential patterns.

CHAPTER 3

SYSTEM CONFIGURATION

3.1 Introduction

Throughout the project I will be designing and implementing the educational application i.e. BITS-Pilani, Dubai website through PHP software and will be performing Breadth First Search for this site using JAVA coding. I will also be using graphs for BFS.

3.2 Hardware Software Requirements

The project work is planned to be carried out using a system with the following configuration

Hardware

Processor- Intel Core 2 Duo T5450 @ 1.67 GHz

Motherboard - Gigabyte P43-ES3G - Intel P43 Chipset +ICH10

RAM - Lenovo 2.00 GB RAM (DDR2 1066 MHz

Hard Disk - Seagate 160 GB SATA

Optical Drive - Lenovo SATA 32x DVD-writer

Network card - Onboard

Sound Card - Onboard high-definition Audio

Graphics card - NVIDIA GeForce 8400 GT-256 MB

Software

Operating systems - Windows Vista SP 2

Office suite - Microsoft Office 2010

JCreator 4.50 Pro

JDK

Telnet

Graph Traversal Applet

CHAPTER 4

DESIGN

4.1 User Interface

4.1.1 JCreator

JCreator is a Java IDE created by Xinox Software. Its interface is close to that of Microsoft's Visual Studio.

Features of JCreator are:

Custom colour schemes

Infolding the existing projects

Various JDK profiles are supported

Rapid code inscription via project prototypes

Easy project inspecting through class browser

Debugging using an easy interface. No command line prompts necessary

Automatic Class path configuration

UI customization

The run time environment can run the application in a J Unit background or in a command line window as an applet

JCreator IDE does not require a Java Runtime Environment to execute that makes it faster than Java-based IDE's.

Starting a new program

JCreator creates the directory for the user.

Select Project->New Project (fig 4.1.1 a) .In the dialog name is given to the project. For example, if our homework files are contained inside a directory hw1, call the project hw1.

Edit the Location field (contains the full index path of program files), such as c:\homework\hw1.

Fig 4.1.1 a Creating a new project

Click on the Ok button. The project is opened. JCreator adds a Java file with the same name as the project such as Hw1.java. Select Edit->Delete from the menu to delete the file. A new class is added to the project by choosing Project->New class from the menu.

Fill in the name of the class (such as Hello). A new file (such as Hello.java) is added to the project. Type the program into the source window. Note the code accomplishment feature that suggests how to complete partially typed lexes.

Compiling a program

Select Project->Active Project and then select the current project from the submenu.

To compile a program, select Build->Compile Project or hit the F7 key. Compiling errors are displayed in a separate window (fig 4.1.1 b).

Fig 4.1.1 b Compliling the the project

Running a program

After the program compiles successfully run it. Select Build->Execute Project or hit the F5 key. The program executes in a separate console window (Fig 4.1.1 c).

Fig 4.1.1 c Running the project

4.1.2 The Graph Traversal Applet

The Graph Traversal Applet animates Breadth First Search or Depth First Search algorithms and shows how graphs are traversed in each algorithm.

The applet provides a simple user interface helping users to control all the tasks needed to run the animation. For designing the user interface priority is given to allow users to control the animation and the display at all times instead of the applet controlling the user actions and the rate of animation [5].

The user interface is shown below:

Figure 4.1.2: the user interface of the graph traversal applet

4.2 Architecture and Data

4.2.1 Java File Handling

The Java's I/O system is stored java.io. Put java.io when needed to read or write files. The other package including file handling classes is java.util.zip. The classes in java.util.zip can produce compressed or decompressed files. This class is based on the functions provided the I/O classes defined in java.io. Therefore it is unified in to Java's overall I/O strategy.

Sample

public void writeFile() {

String filename = "c:\\test.txt";

try {

BufferedWriter writer = new BufferedWriter(new FileWriter(filename,true));

writer.write("test the input.");

writer.newLine();

writer.write("line 4");

writer.close();

} catch (IOException g) {

g.printStackTrace();

}

}

4.2.2 Breadth-first heuristic search

Frontier search is a memory efficient approach to graph search that only stores the Open list and saves memory by not storing the 'closed list'. It uses techniques to prevent regeneration of previously closed nodes that are no longer in memory as the traditional trace back method of solution recovery requires all closed nodes to be in memory. It also uses a divide and conquer method of solution recovery in which each search node keeps track of an intermediate node along the best path that leads to it. When the goal node is selected for expanding the intermediate node associated with it must be on an optimal solution path. This intermediate node is used to divide the original search problem into two sub problems - Finding an optimal path from the start node to the intermediate node and to find an optimal path from the intermediate node to the goal node. Then these sub problems are solved recursively by the same search algorithm until all the nodes along an optimal solution path for the original problem are identified [1].

My project data will be the web application i.e. educational application for BITS-Pilani, Dubai having 3 areas of interest:

Special Interest Group (SIG)

Conference Alerts

Higher Education

With the users:

Students

Library staff

Faculty

Therefore the root of the search tree will be BITS-Pilani, Dubai, parent nodes-SIG, CA, HE and children- students, library staff and faculty. For this tree BFS will be done.

CHAPTER 5

IMPLEMENTATION

5.1 Introduction

In this chapter we will discuss about the algorithm regarding the generation of the graph by inputting the structure of the website (say www.bitsdubai.com) and then implementing breadth first search for this graph generated. The coding of the program is done with java and the platform is JC Creator.

The program is an open program i.e. it can be implemented for any web application and by any user.

5.2 Algorithm

Input

The inputs for the program are the links associated with site www.bitsdubai.com these links act like nodes and edges for the graph.

For the class ProjGraph :

To add and connect vertices

Parsing a text file that holds the data of our graph

Perhaps to override the toString() method to check if the graph is properly built

For the class BFS :

Track the edges of a specific vertex

Checking if some vertex is connected to some other vertex or not

Generating Graph

Creation of the adjacency list and parsing the data file

As the graph is a undirected graph, so we can connect vertex X to vertex Y and vice versa by using a private-helper method.

If vertex X already exists in the adjacency list, get its edge list and add vertex Y to it.

If it doesn't exist create a new Array List and add vertex Y to it and put it all in the adjacency List.

The function "private void connect()" returns true iff vertex X points to vertex Y.

The function "return edges.contain()" returns all the edges of a certain vertex or throws an exception if the vertex does not exist in this graph ("throw new RuntimeException()").

The function "private void parseDataFile(String filename)" reads a text file with the graph-data.

Then the string representation of the graph is created.

The main function of the program takes the input from the data file using file handling and generates the graph.

Performing BFS

A stack is created to store all the vertices of our path.

First the 'end' vertex is pushed on the stack.

Then a loop is put from the highest level back to the first level and a loop through each level.

Each vertex in that level is checked whether it is connected to the vertex on the top of the stack. If a match is found that match is pushed on the stack and broken out of the loop.

Now the stack is reversed before returning the path in the correct order (from start to finish).

During the BFS traversal a queue 'cont' is being created consisting of newly discovered vertices where the first vertex (the starting point) is on index 0 i.e. (level 0), it's edges at index 1 (level 1) and so on.

Get the level from the queue which was added last and then a new level is created to possible unexplored vertices in.

Loop is done through each of the vertices from the last added level and connects the neighbouring vertices to each vertex.

Then looping through each connecting vertex and if the connecting vertex has not yet been visited, only then add it to the newly created level-list and mark it as visited in our set.

We do not need to search any further if we stumble upon the 'end' vertex .In that case just 'return' from this method.

Only make the recursive call if we have found vertices which have not been explored yet.

An empty 'container' is created to store all the levels from the 'start' vertex in. A level is also a list with one or more vertices.

The first level only holds the 'start' vertex, which is added first, this is the 'initial' list.

The 'start' vertex is also stored in asset which keeps track of all the vertices we have encountered so that we do not traverse vertices twice (or more).

Once the above 3 steps are initialized, we can call the actual BFS-Algorithm

Once the BFS-Algorithm is done, we call the backTrack() method to find the shortest path between 'end' and 'start' between the explored levels of the graph.

5.3 Summary

Therefore the main method in the program involves getting the structure of the website for which the BFS has to be performed and then parsing the text file (java file handling) that holds the data of our graph.

Therefore the main method involves:

Creation of the graph

File handling

Getting the shortest path between 2 vertices (BFS) and print the shortest path

CHAPTER 6

DISCUSSION

6.1 Performance Analysis of BFS

Depth

Nodes

Time

Memory

0

2

4

6

8

10

12

14

1

111

11,11

106

108

1010

1012

1014

1 millisecond

seconds

11 seconds

18 minutes

31 hours

128 days

35 years

3500 years

100 B

11 B

1 MB

111 MB

11 GB

1 TB

111 TB

11,111 TB

BFS depth requirements on time and memory

6.2 Applications of BFS

Breadth first search is very helpful in graph theory, for example:

Finding all nodes within one connected module

Finding the shortest path between two nodes

Analysing a graph for bipartiteness

Computing the maximum flow in a flow network [11]

6.3 Advantages of BFS

BFS can be implemented to any search problem. BFS never get entombed exploring the useless path. For more than one solution, BFS can find the minimal one that requires less number of steps.

BFS can be used to experiment bipartiteness by starting the search at any vertex and giving alternating labels to the vertices visited.

6.4 Disadvantages of BFS

The disadvantage of BFS is being a "blind search" i.e. when the search space is large the search recital will be poor associated to other exploratory searches .BFS performs well for the small search space. It performs best when the goal stated is in upper left hand side of the tree. It performs judiciously poorly relative to other search algorithms if the goal state lies in the bottom of the tree. The main drawback of BFS is its memory obligation. If any solution is far away from the root, BFS will consume lot of time.

CHAPTER 7

FUTURE EXTENSIONS

My Project can be implemented in future for Image Processing Problems.

The basis of my project i.e. breadth first search can be used for the dispensation of digital images by using competent algorithms for corroding, distending, skeletonizing, and distance transforming regions. These algorithms traverse in breadth first way, using a queue for storage of unrefined pixels and manage memory efficiently. As soon as the processing has been completed the pixels are removed from the queue. The image is still epitomized as a pixel milieu in memory but the graph is just a appropriate structure for thinking about the algorithms. All the algorithms will be based on viewing a raster image not as a pixel matrix but as a graph whose vertices represent pixels and whose edges represent neighbourhood between pixels. This suggests that pixels in the image can be traversed in a breadth-first manner, rather than with the traditional raster scan. Because pixels to be processed are stored in a queue, breadth-first traversal yields propagation patterns that grow or shrink uniformly across their boundary like wave fronts.

The algorithm will be using following:

Flood Filling (or Object labelling)

Storing Boundary Pixels

Repeated Erosions

Distance Transformation

Skeletonization

My project can also be implemented in future for creating an educational website where the user can select keywords that are present in world-wide web. Then using breadth first search to extract the nodes and their values in a web page and map them to a proper format so that it is easy for the users to search for a particular data whenever they want.

CHAPTER 8

LESSONS LEARNT

In my project regarding BFS I have learnt that BFS is a comprehensive search algorithm. It is easy in implementation and can be used for any search problem.

BFS has no probable infinite loop problem which crashes the computer. Breadth first search never gets entombed while reconnoitring the useless path. If a solution exists, BFS traces it out.

Moreover I have learnt lot about web mining, graphs, java file handling, parsing etc. and their use in computer science.

CHAPTER 9

CONCLUSION

The above report concludes the 4 months of my project work on Design and Implementation of BFS on Implicit Graph for Web Mining Application. My work mainly revolved around studying and analyzing as much literature as possible on topics like running time of algorithms, graphs, breadth first search, graph traversal applet, web mining, JCreator, PHP etc. In this report I discussed about the literature, algorithm, design (data, program and interface) java coding, implementation of the algorithm of BFS, execution time of the algorithm, creation and traversal of the graph for BFS.

Therefore in this report where I got to learn, understand and comprehend the various aspects of my project topic practically. The last 4 months' worth of knowledge and data helped me in analyzing my project. I would conclude by saying that thorough knowledge about the aforementioned topics will provide for a concrete foundation for my aspirations of managing engineering in the near future.

REFERENCES

Barnat, J.; Brim, L.; and Chaloupka, J. 2003. Parallel breadth-first search LTL model checking. In Proceedings of the 18th IEEE International Conference on Automated Software Engineering (ASE'03), 106-115

Bang-Jensen, J. & Gutin, G.: Digraphs: Theory, Algorithms and Applications. Springer-Verlag (2002)

Mehlhorn, K.: Graph Algorithms and NP-Completeness. Springer-Verlag (1984)

Reinhard Diestel Graph Theory Electronic Edition 2000 Springer-Verlag New York 1997, 2000

Introduction to The Design and Analysis of Algorithms, second edition, Pearson International Edition, Anany Levitin

Jeyalatha Sivaramkrishnan, Vijayakumar Balakrishnan, "Web Mining Functions in an Academic Search Applications", Informatica Economică vol. 13, no. 3/2009.

Java Breadth First Search problem, bytes.com

http://bytes.com/topic/java/answers/626759-java-breadth-first-search-problem