Extensible Markup Language is a simple text format which was designed to describe data using custom tags defined in Document Type Definitions (DTD) or an XML Schema. The use of custom tags made XML extremely flexible and enables it to not only describe structured data like information from a table in a relational database but also semi-structured data. An XML document along with its DTD or Schema is also self-describing which has made it a standard means of data exchange between applications and for use in configuration files of enterprise applications. XML is widely used in many web and database applications. The increasing preference to store and transmit data in the XML format has led to a need for searching these data stores for information. Query languages like Xpath and XQuery are used to retrieve information from xml document. But these query languages are complex for non expert user to learn and retrieve information from xml document. Keyword search allows such user to retrieve information without knowledge of complex query language. The Keyword search algorithm must support both specific and non specific search by considering user's search intension, keyword ambiguity and relevance ranking of result. In this paper we propose an algorithm that allows user to retrieve information from multiple xml documents by considering user's search intension , keyword ambiguity and relevance ranking of result, also it allows user to retrieve specific records by making use of predicates .
Keywords: Natural processing, XML, Keyword Search, Relevance Ranking.
1 INTRODUCTION
Language (XML) lets developers simply and flexibly represent and exchange eXtensible Markup data over the Web has dramatically increased XML document proliferation. This, in turn, has increased the demand for ad hoc techniques for XML query processing. As like in any data management application, XML-based systems can achieve effective and efficient query execution by providing a sufficiently expressive query language, such as XQuery [1] and XPath [2] for XML; but these query languages are very complex to understand/learn to the non expert user. It is desirable to support keyword search in XML database, to retrieve all records having almost all query keyword. It is nothing but finding the smallest sub-structures in XML data that each contains almost all query keywords in either the tree data model or the directed graph (i.e. digraph) data model. It is a user friendly way to query XML databases since it allows users to pose queries without the knowledge of complex query languages and the database schema. XML keyword search generates hundreds of results (nodes) which may have different percentage of desirability. It is required to rank these results so that the results will be displayed in relevance ranking order to a user query, just like the popular web engine that accepts the keyword query and the desirable text documents are displayed according to their relevance to the keyword query.
2 EFFECTIVE XML KEYWORD SEARCH
Non specific search query includes only keywords which generate hundreds of result. Effectiveness in term of result ranking is the most crucial part in keyword search. Along with this it is required to understand the intension of user correctly and consider keyword ambiguity problem. User's search intension can be determined by indentifying exactly what type of node user want to search; these nodes are called as search for nodes and which node have high confidentiality as search via nodes .
Another problem with XML keyword queries is that it usually has ambiguities in interpreting the search for node(s) and search via node(s), due to three reasons below [4].
1. A keyword can appear both as an XML tag name and as a text value of some other nodes.
2. A keyword can appear as the text values of different types of XML nodes and carry different
meanings.
3. A keyword can appear as an XML tag name in different contexts and carry different meanings.For example in fig. 1, "street" is present as tag node as well as a data node.
Along with non specific search it is desirable to support specific search. For example suppose an user wants to buy a car on the Internet might not know how exactly the price and category of a car are represented at the dealer's Web site; rather than looking at the DTD, the user would prefer to directly ask for all cars with name Toyota price < 500000; this query involves a keyword search for price and the evaluation of a predicate on the value of price along with the name of car. Such keyword query must select the nodes of xml document having car name Toyota and price < 500000, in this paper we proposed the efficient algorithm that supports specific and nonspecific keyword search having following features.
It allows search on xml repository consists of multiple documents.
It support specific search by making use of relational operators.
Relevance ranking of the result set based on statistics of data in document.
Considers user search intension and keyword & tag ambiguity problem [4].
DTD independent.
3 RELATED WORK
Extensive research efforts have been conducted in XML keyword search to find the smallest sub-structures in XML data that each contains all query keywords in either the tree data model or the directed graph (i.e. digraph) data model. Various methods like SLCA, LCA [5] that are used for the same but none of them has addressed and resolved the three issues mentioned in section 2. For instance, one widely adopted approach so far is to find the smallest lowest common ancestor (SLCA) of all keywords .Each SLCA result of a keyword query contains all query keywords but has no sub tree which also contains all the keywords. Consider a keyword query "customer interest art" issued on the bookstore data in Figure 1, and most likely it intends to find the customers who are interested in art. If adopting SLCA, 5 results will be retrieved which include the title of book B1 and the customer nodes with IDs from C1 to C4 (as these four customer nodes contain "customer", "interest" and "art" in either the tag names or node values) in Figure 1. Since SLCA cannot well address the search intention, all these 5 results are returned without any ranking applied. However, only C4 is desired which should be put as the top ranked one, and C2 is less relevant, as his interest is "street art" rather than "art", while C1 and C3 are irrelevant.
XSeek [6] generates the return nodes which can be explicitly inferred by keyword match pattern and the concept of entities in XML data. However, it addresses neither the ranking problem nor the keyword ambiguity problem. Besides, it relies on the concept of entity (i.e. object class) and considers a node type t in DTD as an entity if t is "*"-annotated in DTD. As a result, customer, phone, interest, book in Figure 1, are identified as object classes by XSeek. However, it causes the multi-valued attribute to be mistakenly identified as an entity, causing the inferred return node not as intuitive as possible. E.g. phone and interest are not intuitive as entities. In fact, the identification of entity is highly dependent on the semantics of underlying database rather than its DTD, so it usually requires the verification and decision from database administrator.
Figure 1. Portion of data tree for an online bookstore XML database [4]
4 RELEVANCE RANKING
4.1 TF - IDF
The TF-IDF weight (term frequency-inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a flat document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. TF*IDF similarity is given by Eq.1
Eq. 1
where q represents a query, d represents a flat document and k is a keyword appearing in both q and d. A larger value of (q, d) indicates q and d are more relevant to each other. Wq,k and Wd,k represent the weights of k in query q and document d respectively, while Wq and Wd are the weights of query q and document d. The formulas of Wq,k , Wd,k ,Wq and Wd are given below.
Eq. 2
Eq. 3
Eq. 4
Eq. 5
Text documents are flat documents, relevance ranking score is calculated using TF/IDF formula. But this formula can't be applied to the XML document since it is semi structured document.
4.2 XML TF*IDF SIMILARITY
XML term Frequency fa,k : The number of occurrences of a keyword k in a given data node a in XML data . It is analogous to term frequency fd,k (in Formula 3) .XML Document Frequency fTk : The number of T-typed nodes that contain keyword k in their sub trees in XML data. It is analogous to document frequency fk (in Formula 2).Following formulas are used to calculate XML tf*idf similarity [4].
(a) Where a is value node
(b) Where s is internal node
Eq.6
Eq.7
Eq.8
1+ Eq.9
Eq.10
Eq. 11
In Formula 6, q represents a keyword query, a represents an XML node Sim (q, a) represents the similarity value between q and a. In the case a is leaf node, the similarity values between XML leaf nodes and a given query is calculated using 6.a , in a similar way to the original TF*IDF, since leaf nodes contain only keywords with no further structure. In the case of a is an internal node formula 6.b is used to recursively calculate the similarity between a and query. represents the weight of keyword k in q with respect to node type Ta. is a normalization factor in the recursive case of XML TF*IDF similarity formula. is the total number of nodes of type Ta while k is the number of Ta-typed nodes containing keyword k. represents the confidence of a data node v as the node to search via with respect to a keyword k appearing in both query q and v.
is used to capture the keyword co-occurrence and is maximum of In-Query Distance(IQD) and Structural Distance (SD). The In-Query Distance(IQD) is the distance between keyword k and node type kt in a query q is defined as the absolute value of the position distance between kt and k in q; otherwise, Distq(q, k,,k)=∞. Structural Distance (SD) is the distance between kt and k with respect to a data node v is defined as the depth distance between v and the nearest kt-typed ancestor node of v in XML data. The node type having highest value of Sfor is considered as the required search for node and the node type have highest value of Svia is considered as required search via node. Sfor and Svia can be determined by Eq. 11and 10. Product used in Eq. 11 make sure that T type of node is related to all keywords in query, if it is not associated each keyword then the value of Sfor will be zero. r is reduction factor having value in range 0 to 1. Depth (T) represents the depth of T-typed nodes in document.
5 DATA MODEL
• Root node: There is one root node for every XML document. A root node does not occur except as the root of the tree. The element node for the document element is a child of the root node. The root node also has as children processing instruction and comment nodes for processing instructions. For example in figure 1 StoreDB is the root node.
• Structure/Tag nodes: Structure nodes are the mark-up in XML documents. They are defined by a character sequence (the element name) enclosed within opening and closing tags (<tagName..>) and can optionally have a set of attributes. The closing tag of elements has only the element name (</tagName>). Structure nodes can also contain a character data between the start and end tags. In the tree view of the document, there is structure node for every element in the document. Structure nodes can have other structure nodes (represents nested structure) and text nodes as children. Here it is assumed that the structure node is of single keyword. In figure 1 customer, customer, id, name, interest are the examples of structure nodes.
• Data/Leaf/Value nodes: Character data occurring between the open and closing element tags is grouped into value nodes. Each value nodes is considered to be a descendant of the structure node. In figure 1 Mary Smith, art, Art Street are the examples of value nodes.
• Node Type: The type of a value/structure node n in an XML document is the prefix path from root to n. Two nodes are of the same node type if they share the same prefix path. In Figure 1, the name of publisher and the name of customer are of different node types, which are storeDB/books/book/publisher/name and storeDB/customers/customer/name respectively.
In addition to the above, the data model retained for XML and most query languages [7] defines namespace nodes, processing instruction nodes, and comment nodes, Attributes and also link nodes, which make it possible to refer to other elements without explicitly including them. These nodes are ignored.
6 DESIGN
Figure 2 shows architecture of a search service with the XML document retrieval system in its central part. The system consists of the Indexer and the Query engine. The search repository is a set of XML documents. Each document can have different schema. Indexer creates main four indices to capture information of each node in each document while parsing the document. Each document parsed only once using sax parser. A virtual node is created that act as the root of all documents. Indices created during parsing are - Node table, Keyword table , Nodetype table and Frequency table. Node index consists of dewey id[3],node type, keyword info belongs to a node and flag to indicate whether a node is leaf or internal node. Keyword table contains a node id that having the corresponding keyword. Nodetype table contains a node id that having the corresponding nodetype in node table. Frequency table stores the frequency fTk for each combination of keyword k and node type T in XML document. Unique ids for keyword and nodeype are generated using global hash table.
Figure 2. System Architecture
7 ALGORITHM
In Algorithm Key_Search, step 1-3 determines an intension of user by selecting the corresponding nodetype of node which have high value of Sfor.Step 4 and 5 selects all internal node of selected node type and leaf nodes using keyword table and nodetype table. Step 6-10 calculated similarity of internal node a if it is ancestor of any leaf node selected in step 5. getsim algorithm recursively calculates similarity of node a with a query q. Hash tables are used in order to avoid searching.
Key_Search(q)
1 For each document d in repository do
2 Calculate Sor of each node type for each keyword q d
3 Choose all the nodetype with nonzero value of Sfor.
4 Select all internal nodes N for chosen nodetype in step 3.
5 Select the leaf nodes L
Case 1 : q with no predicates : Select the leaf nodes L having at least one keyword in query .
Case 2 : q with predicates : Select the leaf nodes L that satisfying given predicates.
6 for each i=1 to N
7 For each j=1 to L
8 if(isAncestor(i,j))
9 T[k++]=i;
10 for each i=1 to k
Calculate Similarity S[i]=getsim(q,i)
11 Sort T in descending order of S
12 Return T
getsim(q,a)
1 if (isLeaf(a)) then
2 for each k q a
3{Svia(q,a,k)=getQueryKWCo-occur(q,a,k)
4}
5 else //internal node
6 =getQWeight(a,q)
7 for each c child(a) do
8 {Tc=getNodeType(c)
9
10 Sum+=getsim(c,q)*}
11 Return Sum/
In Algorithm Key_Search, step 1-3 determines an intension of user by selecting the corresponding nodetype of node which have non zero value of Sfor. Step 4 and 5 selects all internal node of selected node type and leaf nodes using keyword table and nodetype table. Step 6-10 calculated similarity of internal node a if it is ancestor of any leaf node selected in step 5. getsim algorithm recursively calculates similarity of node a with a query q.
Figure 3. Precision and Recall Measure
Data
Data Size
Index Time
WSU
678 KB
2.146 seconds
Book
143 KB
1.33 seconds
ebay
32 KB
0.234 seconds
Table 1. Indexing Time
8 EXPERIMENTS
To verify the effectiveness of our proposed algorithms, we conducted extensive experiments on the dataset WSU and ebay from [8]. Algorithms are implemented in java. All the experiments were conducted on a 1.87GHz dual-core desktop PC with 1GB of RAM and are shown in table 1 and figure 3. Table 1 indicates an indexing time required for individual file and its size [8]. Figure 3 indicated two metrics in percentage: Precision and Recall that are used for retrieval performance evaluation [9] for following queries on WSU.
230
Biology
place , bldg = TODD
course , limit <= 10
days , TU , TH
9 CONCLUSION
Proposed algorithm for keyword search on xml repository is effective and works on multiple documents. Like XReal [4] it includes the identification of user search intention and result ranking in the presence of keyword ambiguities and returns the results according to relevance ranking to a query. In addition to this it allows user to make specific search by supporting predicates along with keywords.