Research proposal selection is recurring activity in government research funding agencies. It is a multi-process task that begins with a call for proposals (CFP) by a funding agency. Fig. 1 shows the CFP is distributed to communities such as universities or research institutions. They are then assigned to experts for peer review. The review results are collected, and the proposals are then ranked based on the aggregation of the experts review results.
Fig. 1: Research proposal selection process
So, a Plagiarism Detection approach is proposed to face information overloads in government agencies. This approach is based on two-tier client-server architecture. Fig. 2 shows that there are two users namely- server and client. In the proposed method, Division manager acts as the server. It contains the database with existing research proposals. Referring this database, the System is responsible to create ontology, which is a knowledge repository. The division manager is then responsible to use the created ontology and perform clustering. The system then assigns the grouped clusters to respective peer reviewers. The peer reviewers are acting as clients. They are experts in their respective disciplines. When the system creates the clusters, peer reviewers receive their respective clusters with the proposals grouped, with respect to their domains that act as disciplines. When the call for proposals (CFP) [5] is made by a funding agency from the end-user, the system finds the discipline of the research Proposal submitted by end-user. The system then submits the proposal to the respective peer reviewer, in order to find the similarity between the existing research Proposals of the respective cluster.C:\Documents and Settings\aaa\My Documents\My Pictures\s1.bmp
Fig. 2: Proposed Method
Thus, the proposed method aims- To make manual process of clustering research Proposals computerised. The system helps to find accurate discipline of a research proposal. The system allows Reviewer to ensure the ambiguity of the end-user submitted research Proposal. A Plagiarism Detection approach help's Reviewer to find relevant research proposal, that poses similarity with the end-user submitted research Proposal.
LITERAT URE REVIEW
Research proposal selection activity is recurrent. There are many existing formal methods available. W.M. Wangn, C.F.Cheung [6] had proposed- Semantic based intellectual property management system, an automated system for assisting the inventors in patent analysis. In government funding agencies, patent databases provide valuable information for technology management. However, the rapid growth of patent documents, the lengthy text and the rich of content in technical terminology, and the complicated relationships among the patents, make it taking a lot of human effort for conducting analyses. As a result, an automated system for assisting the inventors in patent analysis as well as providing support in technological innovation is in great demand. A Semantic-based Intellectual Property Management System (SIPMS) [6] had been developed for supporting the management of intellectual properties (IP). It incorporated semantic analysis and text mining techniques for processing and analyzing the patent documents. The method differentiates itself from the traditional technological management tools in its knowledge base. Instead of eliciting knowledge from domain experts, the proposed method adopts global patent databases as sources of knowledge. The system enables users to search for existing patent documents or relevant IP documents which are related to a potential new invention and to support invention by providing the relationships and patterns among a group of IP documents. But, this method proposed a hybrid knowledge-based approach to assign reviewers the clustered research proposals.
Current methods group proposals according to keywords, due to which research proposals with similar research areas might be placed in wrong groups due to the following reasons: First, keywords are incomplete information about the full content of the proposals. Second, keywords are provided by applicants who may have subjective views and misconceptions, and keywords are only a partial representation of the research proposals. Third, manual grouping is usually conducted by division managers in funding agencies. They may have different understanding about the research disciplines and may not have adequate knowledge to assign proposals into the right groups. Text-mining methods (TMMs) [4] have been designed to group proposals based on understanding the English text.
It uses text-mining, [3] multilingual ontology, technique to cluster research proposals based on their similarities.
III. PLAGIARISM DETECTION OF RESEARCH PROPOSALS
Ontology is a knowledge repository in which concepts and terms are defined as well as relationships between these concepts. It consists of a set of concepts, axioms, and relationships, as shown in Fig. 9, that describe a domain of interests and represents an agreed-upon conceptualization of the domain's "real-world" setting. Implicit knowledge for humans is made explicit for computers. The proposed PDRP used consists of three phases, as:
(Phase 1). First, a research ontology containing the existing proposals is constructed according to keywords, and it is updated with every new proposal.
(Phase 2).Then, new research proposals are classified according to discipline areas using a sorting algorithm.
(Phase 3).Next, with reference to the ontology, the new proposals in each discipline is clustered using a self-organized mapping (SOM) algorithm.
C:\Documents and Settings\aaa\My Documents\My Pictures\s2.bmp
Fig. 3: System Architecture
Before these phases, there is a pre-processing phase, where ontology is been constructed by the system. This is the initial state, as shown in Fig. 4, where the Division manager consists of randomly scattered proposals.
C:\Documents and Settings\aaa\My Documents\My Pictures\s4.bmp
Fig. 4: Pre-processing-Initial state
Then, the system groups the research proposals, as shown in Fig. 5, according to their disciplines and forms respective clusters.
C:\Documents and Settings\aaa\My Documents\My Pictures\s4.bmp
Fig. 5: Pre-processing-Cluster by Domain
After forming the clusters, the system assigns them to respective peer reviewers, as shown in Fig. 6, which act as experts in their respective disciplines.
C:\Documents and Settings\aaa\My Documents\My Pictures\s4.bmp
Fig. 6: Pre-processing-Assign to Reviewers
Now, when the input proposal arises, the system finds the exact discipline of the research proposal, referring the existing constructed ontology. On finding the exact discipline, the system then assigns the input proposal to the matched cluster, as shown in Fig. 7. Then, peer reviewer can now check the similarity of the input proposal submitted by end-user, with the existing research proposals of the respective discipline. The system finally outputs the Best Matching Unit (BMU). The system provides Best 5 matched proposals with respect to the input research proposal, in the descending order, with the orders best matched proposals.
C:\Documents and Settings\aaa\My Documents\My Pictures\s4.bmp
Fig. 7: Pre-processing
After the pre-processing phase, the following three phases are to be followed:
Phase 1: Constructing Research Ontology-
Creating the research topics of the discipline Ak ,(k = 1, 2, . . . , K ). The keywords and their frequencies are denoted by the feature set
(Nok , IDk , year,{(keyword1 , frequency1),(keyword2 ,frequency2 ),. . . , (keywordk , frequencyk )}),
where
Nok is the sequence number of the kth record and IDk is the corresponding discipline code.
The keyword frequency in the feature set is the sum of the same keywords that appeared in this discipline and then, the feature set of Ak, as shown in Fig. 8, is denoted by (Nok , IDk , {(keyword1 , frequency1) (keyword2 ,frequency2 ), . . . ,(keywordk , frequencyk )}).
Fig. 8: Feature set of Ak .
Fig. 9: Structure of the research ontology.
Phase 2: Classifying New Research proposals Into Disciplines-
Proposals are classified by the discipline areas to which they belong. A simple sorting algorithm is used next for proposals' classification. This is done using the research ontology as follows:
Suppose that there are K discipline areas, and Ak denotes area k(k = 1, 2, . . . , K ). Pi denotes proposals i(i = 1, 2, . . . , I ), and Sk represents the set of proposals which belongs to area k. Then, a sorting algorithm can be implemented to classify proposals to their discipline areas, as-
Phase 3: Clustering Research Proposals based on Similarities using Text Mining-
After the research proposals are classified by the discipline areas, the proposals in each discipline are clustered using the text-mining technique, as shown in Fig. 10. The main clustering process consists of five steps, as:
Fig. 10: Main process of text mining.
Step 1) Text document collection:
After the research proposals are classified according to the discipline areas, the proposal documents in each discipline Ak (k = 1, 2,...,K ) are collected for text document preprocessing.
Step 2) Text document preprocessing:
The contents of proposals are usually non-structured. The research ontology is then used to analyze, extract, and identify the keywords in the full text of the proposals. Finally, a further reduction in the vocabulary size can be achieved, through the removal of all words that appeared only a few times in all proposal documents.
Step 3) Text document encoding:
After text documents are segmented, they are converted into a feature vector representation V = (v1 , v2 ,..., vM ), where M is the number of features selected and vi (i = 1, 2,...,M ) is the TF- IDF encoding [3] of the keyword wi .
TF-IDF encoding describes a weighted method based on inverse document frequency (IDF) combined with the term frequency (TF) to produce the feature v, such that vi = tfi âˆ- log(N/dfi ),
Where
N is the total number of proposals in the discipline,
tfi is the term frequency of the feature word wi , and dfi is the number of proposals containing the word wi .
Thus, research proposals can be represented by corresponding feature vectors.
Step 4) Vector dimension reduction:
The dimension of feature vectors is often too large; thus, it is necessary to reduce the vectors' size by automatically selecting a subset containing the most important keywords in terms of frequency. Latent semantic indexing (LSI) is used to solve the problem [2]. It not only reduces the dimensions of the feature vectors effectively but also creates the semantic relations among the keywords. LSI is a technique for substituting the original data vectors with shorter vectors in which the semantic information is preserved. To reduce the dimensions of the document vectors without losing useful information in a proposal, a term-by-document matrix is formed, where there is one column that corresponds to the term frequency of a document. Furthermore, the term-by- document matrix is decomposed into a set of eigenvectors using singular-value decomposition. The eigenvectors that have the least impacts on the matrix are then discarded. Thus, the document vector formed from the term of the remaining eigenvectors has a very small dimension and retains almost all of the relevant original features.
Step 5) Text vector clustering:
This step uses an SOM algorithm to cluster the feature vectors based on similarities of research areas. The SOM algorithm [1] is a typical unsupervised learning neural network model that clusters input data with similarities.
Fig. 11: Structure of SOM
Input : Feature Vectors.
Processing : Forms Clusters.
Output : Provides Best Matching Unit(BMU) in terms of X-Y co-ordinates.
CONCLUSION
Today, competition requires timely and sophisticated analysis on an integrated view of data. A new technology leap is needed to structure and prioritize information for specific end-user problems.
Plagiarism Detection Method for Clustering Proposals can make this leap. Research ontology is constructed to categorize the concept terms in different discipline areas and to form relationships among them. It facilitates text-mining technique to cluster research proposals based on their similarities. This method can be used in Government research funding agencies that face information overload problems. It can be used in College Universities to find ambiguity in the SRS, submitted by the students. It can be used for Patent Analysis, for supporting the Intellectual Property Rights.