AEvaluating semantic similarity of concepts is a problem that has been extensively investigated in the literature in different areas, such as artificial intelligence, cognitive science, databases and software engineering. Semantic similarity relates to computing the similarity between conceptually similar but not necessarily lexically similar terms. Currently, it is growing in importance in different settings, such as digital libraries, heterogeneous databases and in particular the Semantic Web. In such contexts, very often concepts are organized according to taxonomy (or a hierarchy). We investigate approaches to compute the semantic similarity between natural language terms. This paper presents approaches for measuring semantic similarity between words and hierarchical structure is used to present information content. A common data set of word pairs is used and two computational measures are calculated.
Keywords- Sense, Concept, Information content similarity.
INTRODUCTION
In this paper, we presented an approach for capturing similarity between words that is concerned with the syntactic similarity of two strings. Semantic similarity is a confidence score that reflects the semantic relation between the meanings of two sentences. It is difficult to gain a high accuracy score because the exact semantic meanings are completely understood only in a particular context. Some dictionary-based algorithms are available to capture the semantic similarity between two words.
Let us consider a very general knowledge base (KB), essentially defined by a set of concepts that are organized according to a generalization (ISA) hierarchy[1] where each concept may be associated with a structure or a feature vector containing the properties describing the concept. A concept has a name, an enumerated set of super-concepts (taxonomic information) and a tuple of typed properties (structural information). For instance, consider the following set of concepts:
person = {name: string, SSN:string}
student = ISA (person) {college: string}
worker = ISA (person) {EIN: string,salary:integer}
machine = {name: string, maker: string}
railcar =ISA (machine) {VIN:string,owner:person}
....
Where SSN, EIN, and VIN stand for Social Security Number, Employer Identification Number and Vehicle Identification Number respectively.
Therefore, a concept has a left hand side, defined by the name of the concept and a right hand side containing the hierarchical and/or structural information. For instance, in the case of the concept of name person, on the right hand side only structural information is present i.e., two typed properties namely name and SSN, both of type string. In the case of student, in addition to the structural information (the property college of type string), we also have a taxonomic information, expressed by the ISA construct. This means that student has a super-concept, namely person, whose typed properties will be inherited. Inheritance is a well-known problem that has been extensively investigated in the literature.
A hierarchy [1] given in Figure 1 contains an arc between student and person, since person is a super-concept of student. Furthermore, suppose it also contains arcs among VIN, SSN, EIN and id number (identification number). The root of the concept hierarchy is labeled by Top which represents the most general concept. Dotted lines stand for paths of arbitrary length.
WordNet has been used for a number of different purposes in information systems, including word sense disambiguation, information retrieval, automatic text classification, automatic text summarization, and even automatic crossword puzzle generation. Another prominent example of the use of WordNet is to determine the similarity between words. Various algorithms have been proposed, and these include considering the distance between the conceptual categories of words, as well as considering the hierarchical structure of the WordNet ontology. A number of these WordNet-based word similarity algorithms are implemented.
The paper is organized as follows. Section II reviews the WordNet taxonomy, database, word senses. Section III briefly describes semantic similarity measurement using information content. Section IV describes the results. And conclusion is given in section V.
WORDNET TAXONOMY
WordNet is a lexical database for the English language [1]. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. The specific meaning of one word under one type of POS is called a sense. Each synset has a gloss that defines the concept it represents. For example, the words night, nighttime, and dark constitute a single synset that has the following gloss: the time after sunset and before sunrise while it is dark outside. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. The database can also be browsed online. WordNet [6] was created and is being maintained at the Cognitive Science Laboratory of Princeton University under the direction of psychology professor George A. Miller. Development began in 1985. WordNet's latest version is 3.0. Most synsets are connected to other synsets via a number of semantic relations. These relations vary based on the type of word.
Synsets are connected to one another through explicit semantic relations. Some of these relations (hypernym, hyponym for nouns, and hypernym and troponym for verbs) constitute is-a-kind-of (holonymy) and is-a-part-of (meronymy for nouns) hierarchies. For example, tree is a kind of plant, tree is a hyponym of plant, and plant is a hypernym of tree. Analogously, trunk is a part of a tree, and we have trunk as a meronym of tree. While semantic relations apply to all members of a synset because they share a meaning but are all mutually synonyms, words can also be connected to other words through lexical relations, including antonyms (opposites of each other) which are derivationally related, as well. WordNet also provides the polysemy count of a word: the number of synsets that contain the word. If a word participates in several synsets (i.e. has several senses) then typically some senses are much more common than others.
A. WordNet Database
For each syntactic category, two files represent the WordNet[6] database - index.pos and data.pos, where pos is either noun, verb, adj or adv. The database is in an ASCII format that is human- and machine-readable, and is easily accessible to those who wish to use it with their own applications. The index and data files are interrelated. The WordNet morphological processing function, morphy(), handles a wide range of morphological transformations.
During WordNet development synsets are organized into forty-five lexicographer files based on syntactic category and logical groupings. grind() processes these files and produces a database suitable for use with the WordNet library, interface code, and other applications. A file number corresponds to each lexicographer file. File numbers are encoded in several parts of the WordNet system as an efficient way to indicate a lexicographer file name. The file lexnames lists the mapping between file names and numbers, and can be used by programs or end users to correlate the two.
The syntactic categories in WordNet are- noun, verb, adjective and adverb. Each lexicographer file consists of a list of synonym sets (synsets) for one part of speech. Although the basic synset syntax is the same for all of the parts of speech, some parts of the syntax only apply to a particular part of speech. Each filename specified is of the form:
pathname/pos.suffix
where pathname is optional and pos is either noun, verb, adj or adv. suffix may be used to separate groups of synsets into different files, for example noun.animal and noun.plant. One or more input files, in any combination of syntactic categories, may be specified. A list of the lexicographer files used to build the complete WordNet database. grind( ) produces the following output files:
The WordNet sense index provides an alternate method for accessing synsets and word senses in the WordNet database. It is useful to applications that retrieve synsets or other information related to a specific sense in WordNet, rather than all the senses of a word or collocation. It can also be used with tools like grep and Perl to find all senses of a word in one or more parts of speech. A specific Word-Net sense, encoded as a sense_key, can be used as an index into this file to obtain its WordNet sense number, the database byte offset of the synset containing the sense, and the number of times it has been tagged in the semantic concordance texts.
A sense_key is the best way to represent a sense in semantic tagging or other systems that refer to WordNet senses. sense_keys are independent of WordNet sense numbers and synset_offsets, which vary between versions of the database. Using the sense index and a sense_key, the corresponding synset (via the synset_offset) and WordNet sense number can easily be obtained.
The sense index file lists all of the senses in the WordNet database with each line representing one sense. The file is in alphabetical order, fields are separated by one space, and each line is terminated with a newline character.Each line is of the form:
sense_key synset_offset sense_number tag_cnt
sense_key is an encoding of the word sense. Programs can construct a sense key in this format and use it as a binary search key into the sense index file. synset_offset is the byte offset that the synset containing the sense is found at in the database "data" file corresponding to the part of speech encoded in the sense_key. synset_offset is an 8 digit, zero-filled decimal integer, and can be used with fseek to read a synset from the data file.
sense_number is a decimal integer indicating the sense number of the word, within the part of speech encoded in sense_key, in the WordNet database. tag_cnt represents the decimal number of times the sense is tagged in various semantic concordance texts. A tag_cnt of 0 indicates that the sense has not been semantically tagged. All of the WordNet noun synsets are organized into hierarchies, headed by the unique beginner synset for entity in the file noun.Tops.
B. WordNet as an Ontology
The hypernym/hyponym relationships among the noun synsets can be interpreted as specialization relations between conceptual categories. In other words, WordNet can be interpreted and used as a lexical ontology [2] in the computer science sense.The WordNet dictionary contains the senses of words. The frequency of particular sense is given in parenthesis and "n" indicate the noun (n in parenthesis).
According to WordNet dictionary [6], the word "person" has three senses:
sense 1: (6833)S: (n) person, individual, someone, somebody, mortal, soul (a human being) "there was too much for one person to do"
sense 2:(1)S: (n) person (a human body (usually including the clothing)) "a weapon was hidden on his person"
sense 3 :S: (n) person (a grammatical category used in the classification of pronouns, possessive determiners, and verb forms according to whether they indicate the speaker, the addressee, or a third party) "stop talking about yourself in the third person"
The word "student" has one sense:
Top of Form
sense 1 : (67)S: (n) student, pupil, educatee (a learner who is enrolled in an educational institution)
student,pupil,educate are called as synonyms of sense 1 of the word "student". The word "worker" has four senses:
sense 1 (14)S: (n) scholar, scholarly person, bookman, student (a learned person (especially in the humanities); someone who by long study has gained mastery in Top of Form
sense 1 :(29)S: (n) worker (a person who works at a specific occupation) "he is a good worker"
sense 2: (4)S: (n) proletarian, prole, worker (a member of the working class (not necessarily employed)) "workers of the world--unite!"
sense 3:(4)S: (n) worker (sterile member of a colony of social insects that forages for food and cares for the larvae)
sense 4::S: (n) actor, doer, worker (a person who acts and gets things done) "he's a principal actor in this affair"; "when you want something done get a doer"; "he's a miracle worker"
The word "machine" has six senses:
sense 1:(33)S: (n) machine (any mechanical or electrical device that transmits or modifies energy to perform or assist in the performance of human tasks)
sense 2:(2)S: (n) machine (an efficient person) "the boxer was a magnificent fighting machine"
sense 3:(2)S: (n) machine (an intricate organization that accomplishes its goals efficiently) "the war machine"
sense 4:(1)S: (n) machine, simple machine (a device for overcoming resistance at one point by applying force at some other point)
sense 5:S: (n) machine, political machine (a group that controls the activities of a political party) "he was endorsed by the Democratic machine"
sense 6:S: (n) car, auto, automobile, machine, motorcar (a motor vehicle with four wheels; usually propelled by an internal combustion engine) "he needs a car to get to work"
The word "railcar" has one sense:
sense 1:S: (n) car, railcar, railway car, railroad car (a wheeled vehicle adapted to the rails of railroad) "three cars had jumped the rails"
The word "identification number" has one sense:
sense 1:S: (n) number, identification number (a numeral or string of numerals that is used for identification) "she refused to give them her Social Security number"
The word "salary" has one sense:
sense 1: (10)S: (n) wage, pay, earnings, remuneration, salary (something that remunerates) "wages were paid by check"; "he wasted his pay on drink"; "they saved a quarter of all their earnings"
The word "college" has three senses:
Top of Form
sense 1:(45)S: (n) college (the body of faculty and students of a college)
sense 2:S: (n) college (an institution of higher education created to educate and grant degrees; often a part of a university)
sense 3:S: (n) college (a complex of buildings in which an institution of higher education is housed)
The word "interest" has seven senses:
sense 1: (62)S: (n) interest, involvement (a sense of concern with and curiosity about someone or something) "an interest in music"
sense 2: (32)S: (n) sake, interest (a reason for wanting something done) "for your sake"; "died for the sake of his country"; "in the interest of safety"; "in the common interest"
sense 3: (21)S: (n) interest, interestingness (the power of attracting or holding one's attention (because it is unusual or exciting etc.)) "they said nothing of great interest"; "primary colors can add interest to a room"
sense 4: (14)S: (n) interest (a fixed charge for borrowing money; usually a percentage of the amount borrowed) "how much interest do you pay on your mortgage?"
sense 5: (7)S: (n) interest, stake ((law) a right or legal share of something; a financial involvement with something) "they have interests all over the world"; "a stake in the company's future"
sense 6: (5)S: (n) interest, interest group ((usually plural) a social group whose members control some field of activity and who have common aims) "the iron interests stepped up production"
sense 7: (3)S: (n) pastime, interest, pursuit (a diversion that occupies one's time and thoughts (usually pleasantly)) "sailing is her favorite pastime"; "his main pastime is gambling"; "he counts reading among his interests"; "they criticized the boy for his limited pursuits"
The word "subject" has eight senses:
sense 1:(20)S: (n) subject, topic, theme (the subject matter of a conversation or discussion) "he didn't want to discuss that subject"; "it was a very sensitive topic"; "his letters were always on the theme of love"
sense 2: (14)S: (n) subject, content, depicted object (something (a person or object or scene) selected by an artist or photographer for graphic representation) "a moving picture of a train is more dramatic than a still picture of the same subject"
sense 3: (11)S: (n) discipline, subject, subject area, subject field, field, field of study, study, bailiwick (a branch of knowledge) "in what discipline is his doctorate?"; "teachers should be well trained in their subject"; "anthropology is the study of human beings"
sense 4: (9)S: (n) topic, subject, issue, matter (some situation or event that is thought about) "he kept drifting off the topic"; "he had been thinking about the subject for several years"; "it is a matter for the police"
sense 5: (4)S: (n) subject ((grammar) one of the two main constituents of a sentence; the grammatical constituent about which something is predicated)
sense 6: (2)S: (n) subject, case, guinea pig (a person who is subjected to experimental or other observational procedures; someone who is an object of investigation) "the subjects for this investigation were selected randomly"; "the cases that we studied were drawn from two different communities"
sense 7: (2)S: (n) national, subject (a person who owes allegiance to that nation) "a monarch has a duty to his subjects"
sense 8:S: (n) subject ((logic) the first term of a proposition)
The sense 1 of word "interest" and sense 3 of word "subject" are semantically similar. To measure the semantic similarity between two words, we use hyponym/hypernym (or is-a relations). Due to the limitation of is-a hierarchy, we only work with "noun-noun". A simple way to measure the semantic similarity between two synsets is to treat taxonomy as an undirected graph and measure the distance between them in WordNet. The length of the path between two members of the same synset is 1 (synonym relations).
C. Similarity Measurement Using Path Length
Semantic similarity can be measured by simply counting the length of the path or node between the concepts. Resnik (1995), said that "the shorter the path from one node to another, the more similar they are" . This figure 2 shows an example of the hyponym taxonomy in WordNet used for path length similarity measurement.
Figure 2. Taxonomy in WordNet
In the above figure, we observe that the length between car and auto is 1, car and truck is 3, car and bicycle is 4, car and fork is 12.A shared parent of two synsets is known as a sub-sumer. The least common sub-sumer (LCS) of two synsets is the sumer that does not have any children that are also the sub-sumer of two synsets. In other words, the LCS of two synsets is the most specific sub-sumer of the two synsets. Back to the above example, the LCS of {car, auto..} and {truck..} is {automotive, motor vehicle}, since the {automotive, motor vehicle} is more specific than the common sub-sumer {wheeled vehicle}.
III. SEMANTIC SIMILARITY USING INFORMATION CONTENT
WordNet connects concepts or senses, but most words have more than one sense. Word similarity can be determined by the best conceptual similarity value among all the concept (sense) pairs. It can be defined as follows:
Where sen (w) denotes the set of possible senses for word w.
Traditionally, in order to evaluate the semantic similarity of hierarchically related concepts, the information content approach is adopted. It is based on the association of probabilities with the concepts of the hierarchy. In particular, the probability of a concept c is defined as:
where freq(c) is the frequency of the concept c estimated using noun frequencies from large text corpora [3] and M is the total number of observed instances of nouns in the corpus. In this example, probabilities have been assigned according to the SemCor project, which labels subsections of the Brown Corpus to senses in the WordNet lexicon. The information content of a concept c is defined as
IC(c) =-log p(c)
1) Lin: Lin [4] takes information content approach for computing the semantic similarity between two words. The information content similarity (sim) of two concepts c1, c2 as follows:
Where c is the concept providing the maximum information content shared by c1 and c2 in the taxonomy, i.e., the more information two concepts share, the more similar they are. Note that c is the upper bound of c1, c2 in the taxonomy whose information content is maximum, i.e., when defined, the least upper bound.
2) Jiang-Conrath: This approach [3] takes both of the concept and their common ancestor in the calculation of similarity. Jiang-Conrath measure gives semantic distance rather than similarity or relatedness.
Dist (c1, c2) =IC (c1) +IC (c2)-2*IC(c)
Where c is the concept providing the maximum information content shared by c1 and c2 in the taxonomy. This distance measure can be converted to a similarity measure by taking the multiplicative inverse of it:
sim (c1, c2) =1/Dist (c1, c2)
Thus sim(c1,c2) gives the similarity between concept c1 and concept c2.
IV. EVALUATION
In this section, we will show the results of two methods listed in section III. We set up an experiment to compute the semantic similarity between words of data set. Table II lists the results of each similarity measure for the pairs of words [3][4][7] using information content approaches. It is used to evaluate semantic similarity of hierarchically organized concepts.
V. CONCLUSION AND FUTURE WORK
In this paper, we present a concept similarity matching method based on information content using the hierarchy of WordNet. The results give the similarity measures of words. We can use this semantic similarity for replacing query with set of synonyms based on the similarity score can indeed enhance the information retrieval (IR) task. Users frequently fail to describe the information they want to retrieve in the search query.
In future work, we will extend the semantic matching approach by computing semantic similarity among different ontologies. The approaches presented here can be further enhanced with incorporating Word Sense Disambiguation (WSD). With the computed similarity, in the Similarity computation module, WSD can be performed by maximizing relatedness for the generation of the concepts required by the query expansion module.