Graph Matching Algorithm For Web Data Mining Computer Science Essay

Published: November 9, 2015 Words: 4436

The complexity of the graph matching approaches make it very difficult to take into account all types of dependencies between the vertices and edges in the model and data graphs. That is why in most existing frameworks defined for this problem only unary and binary relations between vertices are taken into account. The advantage of such simplifications is that the complexity of graph matching is reduced when tackling the original problem , and this allows the use of techniques that require evaluating a lot of individuals through the search for the best solution. On the other hand, these approaches are ignoring many relationships that can be decisive when searching for a satisfactory matching. Due to this, and depending on the type of problem, these simplifications could lead to non satisfactory results. Hence this project deals with a new concept of new approach for large amounts of data. Since huge amounts of data and information of all kinds is being stored over the WWW, this technique is applied to the WWW.

The World Wide Web which is shortly known as WWW and commonly called as The Web is a system of interlinked hypertext documents contained on the Internet. A web page may contain text, images, videos, and other multimedia and navigate between them via hyperlinks. The whole system is like a hub of endless data that can be surfed at all times. A proper system can be developed to approach them in an organized manner. Most web pages will themselves contain hyperlinks to other related pages and perhaps to downloads, source documents, definitions and other web resources. Such a collection of useful, related resources, interconnected via hypertext links, is what was dubbed a "web" of information. When such large amounts of data are stored in such huge quantities the concepts of Graph Matching become hard to ignore. The complexity of the graph matching approaches make it very difficult to take into account all types of dependencies between the vertices and edges in the model and data graphs. That is why in most existing frameworks defined for this problem only unary and binary relations between vertices are taken into account. The advantage of such simplifications is that the complexity of graph matching is reduced when tackling the original problem , and this allows the use of techniques that require evaluating a lot of individuals through the search for the best solution. On the other hand, these approaches are ignoring many relationships that can be decisive when searching for a satisfactory matching.

1.1 Objectives

To date there is no graph based methods are not that commonly used as graph matching techniques depend on the combinational nature of the graphs ,hence it is very difficult to figure out how to decode certain graphical interpretations.

1.2 Motivation

This project will help us in learning about Data Mining and when there is a large amount of data dumped databases or sites. Data mining is a multidisciplinary field, drawing work from areas including database technology, machine learning, statistics, pattern recognition, information retrieval, neural networks, knowledge-based systems, artificial intelligence, high-performance computing, and data visualization. It helps to present techniques for the discovery of patterns hidden in large data sets, focusing on issues relating to their feasibility, usefulness, effectiveness, and scalability. It can be seen that graphs are a versatile and flexible representation formalism suitable for a wide range of problems in intelligent information processing, including the areas of pattern recognition and computer vision. A wide spectrum of graph matching algorithms have become available meanwhile. They range from deterministic approaches, suitable for finding optimal solutions to problems involving graphs with a limited number of nodes and edges, to approximate methods that are applicable to large-scale problems.

1.3 Methodology

Phase 1

Broad Research graph matching techniques for data mining

Phase 2

Understanding obstacles in data mining

Phase 3

Formation of literature after survey, and rough construction of algorithm

Phase 4

Algorithm finalized using pseudo code and testing jdk ubuntu.

Literature survey completed.

Data mining completed

Phase 5

Coding the psedo code from algorithm and construct dogma index

Phase 6

Implementation and design element completed.

Phase 7

Fine tuning of the report, conclusion and submission of the report.

1.4 Summary

This project was chosen to do some major task in graph matching techniques. By figuring out how unmatched nodes are present, the relations between various databases and sites (web based) can be found.

2. Related Work

2.1 Web Mining [3] [6]

Web Mining - an application of Data Mining figures out patterns existing on the World Wide Web (WWW).

What is Data Mining ?

Data Mining is the process of extracting patterns from data . It is an important tool used in the transformation of data into interpretation.

Data Mining is used to uncover patterns in data but is also used on samples of data. The entire porcess will be ineffective if the samples are not a good representation of the larger body of data. If the patterns to be "mined" are not present in the sample data , then data mining will not discover the patterns in a larger body of data.

Data mining usually deals with four classes of tasks namely:

1. Clustering: arranges data into groups, but the groups are not predefined.

2. Classification: arranges data into predefined groups.

3. Regression: finds a function that models the data with least error.

4. Association Learning Rule: searches for relationship between variables.

As it is essential and relevant to the project, lets take a look at Java Data mining, Java Data Mining (JDM) is a standard Java API for developing data mining applications and tools. JDM defines an object model and Java API for data mining objects and processes. JDM enables applications to integrate data mining technology for developing predictive analytics applications and tools.

Let us now take look at Java Data Mining. Java Data Mining is a standard Java API for developing data mining applications and tools. It defines object model and Java API for data mining objects and processes. It also enables applications to integrate data mining technology for developing predictive analytics applications and tools.

There are a few important terms to know under Web Mining.

Web usage mining is the process of finding out what users are looking for on the Internet. Users are interested in either textual or multimedia data. Web mining helps find patterns for a particular group of people, or for Internet users in a particular region.

Web Structure mining: Web structure mining is the process of using graph theory to analyze the node and connection structure of a web site. According to the type of web structural data, web structure mining can be divided into two kinds: The first being extracting patterns from hyperlinks in the web: a hyperlink is a structural component that connects the web page to a different location. And the second being mining the document structure: analysis of the tree-like structure of page structures to describe HTML or XML tag usage.

Why Web mining?

Better Customer relations can be established using these concepts of mining data.

The only drawback to these mining concepts is the loss of data privacy ,as well as grouping of data based on controversial attributes such as race, religion etc. can cause concern amongst organizations

3. Graph matching [1]

Before stepping any further, as this project is based on index based GRAPH MATCHING, let us understand some basics notions and terminology

A graph G = (V,E) in its basic form is composed of vertices and edges. V is the set of vertices(also called nodes or points) and E contained in V Ã- V is the set of edges (also known as arcs or lines) of graph G. The difference between a graph G and its set of vertices V is not always made strictly, and commonly a vertex u is said to be in G when it should be said to be in V .

The order (or size) of a graph G is defined as the number of vertices of G and it is

represented as |V | and the number of edges as |E|.

If two vertices in G, say u, v in V , are connected by an edge e in E, this is denoted by

e = (u, v) and the two vertices are said to be adjacent or neighbours.

3.1 Directed and Undirected graphs

FIG 1.1 Directed graph

Edges are said to be undirected when they have no direction, and a graph G containing only such types of graphs is called undirected. When all edges have directions and therefore (u, v) and (v, u) can be distinguished, the graph is said to be directed. Usually, the term arc is used when the graph is directed, and the term edge is used when it is undirected. In this dissertation we will mainly use directed graphs, but graph matching can also be applied to undirected ones2. In addition, a directed graph G = (V,E) is called complete when there is always an edge (u, u0) in E = V Ã- V between any two vertices u, u0 in the graph.

FIG 1.2: Undirected graph

Graph vertices and edges can also contain information. When this information is a simple label it is called a labelled graph. Sometimes it is observed that edges and vertices contain more information, hence these graphs are known as attributed graphs.

3.2 Classifications of graph matching methods

Many fields such as computer vision, scene analysis, chemistry and molecular biology have applications in which images have to be processed and some regions have to be searched for and identified. When this processing is to be performed by a computer automatically without the assistance of a human expert, a useful way of representing the knowledge is by using graphs. Graphs have been proved as an effective way of representing objects .When using graphs to represent objects or images, vertices usually represent regions (or features) of the object or images, and edges between them represent the relations between regions.In this work, we will consider model-based pattern recognition problems, where the model is represented as a graph (the model graph, GM), and another graph (the data

graph, GD) represents the image where recognition has to be performed. The latter graph is built from a segmentation of the image into regions.

3.3 Exact and inexact graph matching

In model-based pattern recognition problems, given two graphs -the model graph GM and the data graph GD- the procedure of comparing them involves to check whether they are similar or not. Generally speaking, we can state the graph matching problem as follows:

Exact graph matching:

Given two graphs GM = (VM,EM) and GD = (VD,ED), with |VM| = |VD|, the problem is to

find a one-to-one mapping f : VD != VM such that (u, v) in ED iff (f(u), f(v)) in EM. When

such a mapping f exists, this is called an isomorphism, and GD is said to be isomorphic to GM. This type of problems is said to be exact graph matching.

Inexact graph matching:

The term inexact applied to some graph matching problems means that it is not possible to find an isomorphism between the two graphs to be matched. This is the case when the number of vertices is different in both the model and data graphs4. This may be due to the schematic aspect of the model and the difficulty to segment accurately the image into meaningful entities. Therefore, in these cases no isomorphism can be expected between both graphs, and the graph matching problem does not consist in searching for the exact way of matching vertices of a graph with vertices of the other, but in finding the best matching between them. This leads to a class of problems known as inexact graph matching. In that case, the matching aims at finding a non-bijective correspondence between a data graph and

a model graph. In the following we will assume |VM| < |VD|.

The interest of inexact graph matching has been recently increased in the last years due to the application of computer vision to areas such as cartography, character recognition, and medicine. In these areas, automatic segmentation of images results in an over-segmentation and therefore in the data graph containing more vertices than the model graph. That is why applications on these areas do usually require inexact graph matching techniques

In an inexact graph matching problem since we have |VM| < |VD|, the goal is to find a mapping f0 : VD ! VM such that (u, v) 2 ED iff (f(u), f(v)) 2 EM. This corresponds to the search for a small graph within a big one. An important sub-type of these problems are sub-graph matching problems, in which we have two graphs G = (V,E) and G0 = (V 0,E0), where V 0 in V and E0 in E, and in this case the aim if to find a mapping f0 : V 0 ! V such that (u, v) 2 E0 iff (f(u), f(v)) 2 E. When such a mapping exists, this is called a subgraph matching or subgraph isomorphism.

3.4 Subgraph Isomorphism:

FIG 1.3: Subgraph Isomorphism Example

To elaborate on sub graph isomorphism: In theoretical computer science, the subgraph isomorphism problem is a computational task in which two graphs G and H are given as input, and one must determine whether G contains a subgraph that is isomorphic to H. Subgraph isomorphism is a generalization of both the maximum clique problem and the problem of testing whether a graph contains a Hamiltonian cycle, and is therefore NP-complete. However certain other cases of subgraph isomorphism may be solved in polynomial time.Sometimes the name subgraph matching is also used for the same problem. This name puts emphasis on finding such a subgraph as opposed to the bare decision problem

3.5 NP Completeness

1.Any given solution to the problem can be verified quickly (in polynomial time); the set of problems with this property is called NP (nondeterministic polynomial time).

2.If the problem can be solved quickly (in polynomial time), then so can every problem in NP.

Although any given solution to such a problem can be verified quickly, there is no known efficient way to locate a solution in the first place; indeed, the most notable characteristic of NP-complete problems is that no fast solution to them is known. That is, the time required to solve the problem using any currently known algorithm increases very quickly as the size of the problem grows. As a result, the time required to solve even moderately large versions of many of these problems easily reaches into the billions or trillions of years, using any amount of computing power available today. As a consequence, determining whether or not it is possible to solve these problems quickly is one of the principal unsolved problems in computer science today.

FIG 1.4 : Graph Matching

3.6 Complexity of graph matching

Graph matching is considered to be one of the most complex problems in object recognitionin computer .Its complexity is due to its combinatorial nature

Exact graph matching: graph isomorphism

This whole category of graph matching problems has not yet been classified within a particular type of complexity such as P or NP-complete. Some papers in the literature tried to prove its NP-completeness when the two graphs to be matched are of particular types or satisfy some particular constraints, but it still remains to be proved that the complexity of the whole type remains within the NP-completeness atmost. On the other hand, for some types of graphs the complexity of the graph isomorphism problem has been proved to be of polynomial type. An example is the graph isomorphism of planar graphs, which has been proven in to be of polynomial complexity, although the cost of the leading constant also appears to be quite large.

As a result, it can be said that this issue remains as an interesting open theoretical

problem, although it also encourages researchers to try to find polynomial time solutions forthis type of graph matching problems.

Exact sub-graph matching: sub-graph isomorphism

This particular type of graph matching problems has been proven to be NP-complete. However, some specific types of graphs can also have a lower complexity.

For instance, the particular case in which the big graph is a forest and the small one to be matched is a tree has been shown to be of a polynomial complexity

Inexact graph matching: graph and sub-graph homomorphisms

In inexact graph matching, where we have |VM| _ |VD|, the complexity is proved in to be NP-complete. Similarly, the complexity of the inexact sub-graph problem

is equivalent in complexity to the largest common sub-graph problem, which is known to be also NP-complete.

3.6 Obstacles faced in Graph Matching

The complexity of the graph matching approaches make it very difficult to take into

account all types of dependencies between the vertices and edges in the model and datagraphs. That is why in most existing frameworks defined for this problem only unary and binary relations between vertices are taken into account. The advantage of such simplifications is that the complexity of graph matching is reduced when tackling the original problem, and this allows the use of techniques that require evaluating a lot of individuals through the search for the best solution. On the other hand, these approaches are ignoring many relationships that can be decisive when searching for a satisfactory matching. Due to this, and depending on the type of problem, these simplifications could lead to non satisfactory results. Regarding the application of genetic algorithms in graph matching problems, the performance that they present is remarkable, specially when they are combined with dedicated frameworks such as Bayesian ones. In addition, the number of evaluations that they require to reach a solution is very low compared to other stochastic heuristic searches. However,one of their drawbacks is that their performance is very dependent on the high number of input parameters that have to be set. This dependence is specially dependent on the type of cross-over and mutation operators selected. which are more appropriated for particular types of problems, and the user who wants to apply these methods needs to have a lot of experience in order to tune properly the algorithm and obtain satisfactory results.

4. System Configuration [7]

4.1 Introduction

System configuration is the process of setting up the hardware devices and assigning resources to them so that they work together without any complications. A properly-configured system will allow one to avoid nasty resource conflict problems, and make it easier for one to upgrade the system with new equipment in the future. An improperly-configured system will lead to strange errors and problems, and make upgrading a nightmare.

4.2 Hardware and Software Resources

Hardware Requirements:

Windows 95/98/2000/NT/ME/XP/Vista

1.99 GB MB RAM, Up to 1 GB Hard Disk Space, Internet Connection(LAN/DSL/Cable)

These are the basic requirements for any operating system to run the following softwares and programs used for this project. They even run in Linux or Ubuntu based OS but will need a different configuration as they use different file sharing programs.

Software Requirements:

NetBeansIDE Version 6.8 [7]

Java: 1.6.0_18; Java HotSpot(TM) Client VM 16.0-b13

System: Windows XP version 5.1 running on x86

This software runs on Windows, Linux, Mac OS X and Solaris. NetBeans IDE is open-source and free.

JCreator: JCreator is a delevelopment tool for Java coding and it is more efficient and more reliable than other Java tools. It can be of great use for a Java-Specialist as well as a learning programmer. The free version lacks some features like code completion bit also lacks the level of extensibility through third party plugins

4.3 Summary

Thus, easy working with the above mentioned configurations and these softwares can be achieved. There were some troubleshooting issues which were cleared by the help of forums .

5. Graph matching techniques

We can find many techniques applying probability theory to graph matching problems. A review on general purpose probabilistic graph matching can be found in where different types of probabilistic graphs, different techniques for their manipulation, and fitness functions appropriated to use for these problems are presented.

The first works applying probability theory to graph matching use an iterative approach using a method called probabilistic relaxation, and only take into account binary relations and assuming a Gaussian error. Finally, in a new fitness function is developed for graph matching based on a mixture model that gauges relational

consistency using a series of exponential functions of the Hamming distances between graph neighbourhoods. This fitness function is simply a weighted sum of graph Hamming distances. Unfortunately, all these works do not consider dependencies between more than two vertices.

Applying the EM algorithm to graph matching

Another important approach is the EM algorithm are examples of this, in which two similar EM frameworks are proposed for two different graph matching problems. An algorithm for inexact graph matching that concentrates only on the connectivity structure of the graph and does not draw on vertex or edge attributes

Applying other probability theory methods to graph matching

Other examples of probability theory-based techniques are found in references commented, on the following subjects: discrimination of facial regions using simulated

Applying decision trees to graph matching

Decision trees have also been applied to graph matching. An example of this is, in which decision trees are used for solving the largest common sub graph problem

instead of applying queries to a database of models.

6. Design Component

6.1 Scenario Considered

For a given undirected graph G, a matching in a subset of S of edges G in which no two edges in S are adjacent in G. The Maximum Cardinality Matching (MCM) problem is to find the maximum matching cardinality. A node that is not matched is called exposed node. A basic method to find MCM starts with an arbitary matching in Q. An augmenting path P w.r.t Q is found. Hence a new matching is found by taking those edges of P or Q that are not in both P and Q.

6.2 Algorithm and procedure parameter

void cardinalmatch(int m,int n,int nodei[],int nodej[],int pair[])

n: int;

entry: The no. of nodes in the graph from 1 to n

m: int;

entry: The no. of edges in the graph

nodei,nodej: nodei[i],nodej[i] are the i-th edges of the graph i=1,2,..m

exit: In the solution, node I is matched with pair[i], i=1,2,…n. If pair[i]=0 then node i is unmatched.

Algorithm:

Input: A set of 5 inputs are given in accordance to the source code and algorithm where

1st lines states the number of cases taken into consideration

n: states the number of test cases

m: states the number of edges

ith node: starting edge of ith node in the appropriate case number

ith node: ending edge of ith node in appropriate case number

Output: For appropriate case number

Maximum matching

Number of unmatched nodes

Step 1: Input the number of test cases.

Step 2: for(i=0;i<x;i++), where x is the number of test cases

Step 3:

6.3 Source Code

/**

* @(#)finalcode.java

*

* finalcode application

*

* @author

* @version 1.00 2010/4/30

*/

public class finalcode {

public static void cardinalmatch(int m,int n,int nodei[],int nodej[],int pair[])

{

int i, j, k, n1, istart, first, last, nodep, nodeq, nodeu, nodev, nodew;

int height1, height2, unmatch;

int fwdedge[]= new int[m + n + 1];

int firstedge[] = new int[n + 2];

int grandparent[] = new int[n + 1];

int queue[] = new int[n + 1];

boolean outree[] = new boolean[n + 1];

boolean newnode, nopath;

//set up foward star graph representation

n1 = n+ 1;

k = 0;

for (i=1; i<=n; i++)

{

firstedge[i] = k + 1;

for (j=0; j<m; j++)

if (nodei[j] == i)

{

k++;

fwdedge[k] = nodej[j];

}

else

{

if (nodej[j] == i)

{

k++;

fwdedge[k] = nodei[j];

}

}

}

firstedge[n1] = m + 1;

//all nodes are unmatched

unmatch = n;

for (i=1; i<=n; i++)

pair[i] = 0;

for (i=1; i<=n; i++)

if (pair[i] ==0)

{

j = firstedge[i];

k = firstedge[i +1] - 1;

while ((pair[fwdedge[j]] == 0) && (j <k))

j++;

if (pair[fwdedge[j]] == 0)

{

//match a pair of nodes

pair[fwdedge[j]] = i;

pair[i] = fwdedge[j];

unmatch -= 2;

}

}

for (istart=1; istart<=n; istart++)

if ((unmatch >= 2) && (pair[istart] == 0))

{

//istart not yet matched

for (i=1; i<=n; i++)

outree[i] = true;

outree[istart] = false;

//insert root in the queue

queue[1] = istart;

first = 1;

last =1;

nopath = true;

do

{

nodep = queue[first];

first = first + 1;

nodeu = firstedge[nodep];

nodew = firstedge[nodep + 1] - 1;

while (nopath && (nodeu <= nodew))

{

//examine neighbour of nodep

if (outree[fwdedge[nodeu]])

{

height2 = fwdedge[nodeu];

nodeq = pair[height2];

if (nodeq ==0)

{

//an augmentation path is found

pair[height2] = nodep;

do

{

height1 = pair[nodep];

pair[height1] = height2;

if (height1 !=0)

{

nodep = grandparent[nodep];

pair[height1] = nodep;

height2 = height1;

}

}

while (height1 !=0);

unmatch -= 2;

nopath = false;

}

else {

if (nodeq != nodep)

{

if (nodep == istart)

newnode = true;

else {

nodev = grandparent[nodep];

while ((nodev != istart) && (nodev != height2))

nodev = grandparent[nodev];

newnode = (nodev == istart ? true : false);

}

if (newnode)

{

//add a tree link

outree[height2] = false;

grandparent[nodeq] = nodep;

last++;

queue[last] = nodeq;

}

}

}

}

nodeu++;

}

}while (nopath && (first <= last));

}

pair[0] = unmatch;

}

public static void main (String args[]){

int n=14;

int m=14;

int inode[] = {0, 4, 3, 1, 12, 6, 8, 13, 10, 1, 9, 1, 10, 3};

int jnode[] = {0, 2, 13, 5, 7, 11, 1, 9, 12, 6, 3, 11, 2, 1};

int pair[] = new int[n + 1];

cardinalmatch(n,m,inode,jnode,pair);

System.out.println("maximum matching:\n");

for(int i=0; i<=n; i++)

System.out.print(" " + pair[i]);

System.out.println("\n\n no. of unmatched nodes = " + pair[0]);

}

}

7. Conclusion

Since huge amounts of data are present on the web, it is almost essential to learn how to group them using different data mining techniques.After grouping comes the matching,and since this project deals with mainly an index based approach to graph matching it can really be useful,if not on the WWW but on databases(RDF). Web Mining is the future of data mining and making the World Wide Web easier. The hub of the web can be sorted quickly by tracing through the methods of mining. There are some issues that are needed to be kept in check. The organizations need to enhance and develop something huge that can elevate the efficiency of the whole system by debugging the errors in a fraction of a second.