Plagiarism is the act of copying others work without citing the original author properly. Raid on Code Pirate shortly abbreviated ROCOP is a plagiarism detection system. This application detects the plagiarism in a plain text and source codes. The whole system has been divided in two halves; namely plagiarism detection module and web crawler module. The web crawler mines the web pages from the Internet and creates a web base containing huge number of web pages. The plagiarism detection module checks the extent of plagiarism in the provided document by comparing it with the pages in web base, with a value above threshold indicating possible plagiarism.
Introduction
ROCOP has been designed to assist the lecturers in departments with the ultimate goal of discouraging students from copying the source code in their programming assignments and projects. With the rapid growth in the usage of Internet, digital documents are more prone to plagiarism. The system developed till now finds the plagiarism in provided text document within the guaranteed accuracy level imposed by the winnowing fingerprinting algorithm.
We have slightly deviated from our original project description. The crawling part is limited to a domain owing to the limited resources. In addition to that, the entire system is now being developed as a web based system. This will allow many end users to simultaneously access the system online from their browser. The processing in the back end i.e., creating the fingerprint of the document uploaded and comparing with the database is handled by implementing thread for individual user.
Background
Significant amount of work has already been carried out in the field of plagiarism detection. Based on which characteristic properties the systems employ in their comparisons, plagiarism detection systems are roughly grouped into two categories: attribute counting systems and structure metric systems. Systems built using attribute counting techniques can easily be fooled by changing the input slightly because those systems only count the number of unique operators, unique operands and their occurrences.
Based on these two techniques, there are already many plagiarism detection system that help people find the similarities in sets of code. A brief description of few widely used plagiarism detection system is presented below.
JPlag
JPlag is a plagiarism detection tool aiming to detect similarities among source code files. JPlag was created by Guido Malpohi in 1996. It currently supports Java, C#, C, C++, Scheme, and natural language text. JPlag is free but users are required to create an account. JPlag uses a variation of the Karp-Rabin comparison algorithm developed by Wise, but adds different optimizations for improving its run time efficiency.
The operation of JPlag is described as a structure metrics based system where the file is tokenised. The comparison stage is done by comparing the code in pairs and taking sub-strings of one code set and trying to cover the other code set with these substrings. The percentage of the string, which is covered, is the similarity between the two.
MOSS
MOSS (Measure of Software Similarity) is an automatic system for determining the similarity of programs. It was created by Alex Aiken at UC, Berkeley.
Like JPlag, MOSS is also a structure metrics based plagiarism detection system. Local document fingerprinting algorithm called Winnowing is implemented in MOSS. When source code is supplied to the web based interface, it returns the result in a webpage. In the output, MOSS highlights specific lines of the code which appear to be in suspicion. MOSS can run on Unix or Windows systems. It can detect similarity in wide variety of languages.
Turnitin
Turnitin is a web-based plagiarism detection system created by iParadigms, LLC. LLC also offers plagiarism detection techniques for newspaper editors, book and magazine publishers called iThenticate. Other services marketed under Turnitin brand are aimed at education market. Turnitin checks for possible plagiarism by comparing submitted papers to several databases using a proprietary algorithm. It scans its own database(s), and also has licensing agreements with large academic proprietary databases.
Viper
Viper is a plagiarism detection system use to check for both accidental and intentional plagiarism, incorrect citations and missing quotation marks. It is free to use and checks pieces of work against billions of sources including past essays, internet resources and journals.
DOC Cop
DOC Cop is a plagiarism, cryptomnesia and collusion detection tool that creates reports displaying the correlation and matches between documents or a document and the Web. It is an Australian service with fast turn-around, capable of comparing multiple documents at a time.
Literature Review
Different kinds of plagiarism detection systems are available online. Most of which are either close sourced or provided as a paid service. Some of which are restricted to a particular document repository only like a journal repository. There are hardly any services which checks document against a large repository, which is open source, provided free.
An attempt to develop a Free/Open Source system with the large scale repository has resulted into the system termed ROCOP. The large repository also called "web base" can be used for other purposes besides for plagiarism detection. Most importantly, here in Institute of Engineering, Pulchowk Campus, there is no such system that is customizable to fit the requirements like checking the documents against the college's archive. This project was proposed and started as a possible solution to the problem stated above.
As we have already mentioned, our system has been divided into two modules which are discussed as follows.
Web Crawler Module
A web crawler is a program that, given one or more seed URLs, downloads the web pages associated with these URLs, extracts any hyperlinks contained in them, and recursively continues to download the web pages identified by these hyperlinks.
Web Base:
Web base is a Web repository. The system collects a sizable portion of the Web and stores a local copy. The Web crawler module is responsible for downloading pages from the Web and storing them in the repository. Figure below describes the crawler's conceptual operation.
Init
Take a URL
Dowad Page
Extract URLs
Domain
Reposi-tory
Visited URLs
URLs to Visit
As shown in the block diagram, a crawler needs to maintain two data structures so as to manage page URLs: URLsToVisit and VisitedURLs. Those data structures represent the queues. VisitedURLs maintains the list of downloaded pages. The list is necessary for the crawler to avoid downloading a page multiple times in a single crawl. The URLsToVisit queue contains the list of pages to download in the future.
When the crawler is initiated, an URL is manually specified or is either taken from some source. Suppose, the URL specified be http://x.com/y. Then the web page corresponding to that URL is then downloaded referring to the particular domain. That page is then passed to the repository and is stored in compressed form. The URL that has just been visited is then appended to the queue in the VisitedURLs. This step is necessary so that the same page is not downloaded multiple times. Then after, the crawler extracts all hyperlinks from each downloaded page, converts relative links (URLs) into absolute URLs. Suppose the new URL generated is http://x.com/a. Then the Crawler checks whether any of the new URLs has already been downloaded. If a URL has already been fetched, it is discarded. If it has not, then the crawler adds the URL to the URLsToVisit queue. If the new URL generated is suppose say http://z.com then it is also discarded as the URL belongs to the totally different domain than that of the previously specified. To start for downloading the page corresponding to http://z.com, the crawler should be again initiated taking this URL as described for initial site. This process is repeated until the crawler has exhausted the URLsToVisit queue, or is stopped by an operator.
The initial contents of the URLsToVisit queue are called a seed list. This list is expanded over time. Before the crawler can run for the first time the seed URLs are specified manually, or obtained from some source. Basically, the initial list is provided with a diverse set of Web sites, or a crawler is biased towards particular interests using a more focused selection of URLs.
If the fetch of a URL fails, the crawler can return the URL to the URLsToVisit queue for another attempt. The crawler must give up if it is unable to fetch a URL repeatedly, because seemingly transient errors, such as a Web server not responding, can in truth be symptoms of fatal errors (e.g., a Web server is permanently disabled).
Thus, a crawler is able to fetch an arbitrarily large piece of the Web from a small list of seed URLsToVisit, using the pages it fetches to enumerate additional yet-uncrawled pages to fetch.
B. Plagiarism Detection Module
Our plagiarism detection module is being built using the slightly advanced and effective practical technique, structure metric system, which extracts and compares representations of the program structures, thereby giving the improved measure of similarity. This module receives the input document, processes it and then returns the matches in the web base.
The general approach is to remove all the irrelevant features such as white spaces, punctuation marks etc. fom the given document. Then the whole document is divided into k-grams, which is a contiguous substring of length k. There are as many k-grams as there are characters in the document. Then each k-gram is hashed and some subset of those hashes is selected as the document's fingerprints. Fingerprint will also contain the positional information describing the document and location within that document from where the fingerprint is generated. If the hash function is chosen so that the probability of hash collisions is very small, then whenever two documents share one or more fingerprints, it is extremely likely that they share a k-gram as well.
Here the hashing technique and subset of hashes selection techniques used are slightly different.
Hashing
We have used Karp-Rabin Rolling hash function in order to calculate the hashes of the k-grams.
Karp-Rabin Rolling hash function:
Rolling hash function treats every substring as a number in some base, the base being usually a large prime. For example, if the substring is "hi" and the base is 101, the hash value would be 104 x 1011 + 105 x 1010 =10609 (ASCII of 'h' is 104 and of 'I' is 105).
This technique calculates the hash for ist k-gram easily from the hash for the ith k-gram. Treat a k-gram C1C2Ck as a k-digit number in some base b, usually a large prime.
Hash(C1 … Ck) = C1*bk-1 + C2*bk-2+ … + Ck-1*b + Ck
Then, Hash(C2 … Ck) = ( Hash(C1 … Ck) − C1*bk-1 ) * b + Ck+1
The essential benefit achieved by such representation is that it is possible to compute the hash value of the next substring from the previous one by doing only a constant number of operations, independent of the substrings' lengths.
For example, if we have text "abracadabra" and we are searching for a pattern of length 3, we can compute the hash of "bra" from the hash for "abr" (the previous substring) by subtracting the number added for the first 'a' of "abr", i.e. 97 Ã- 1012 (97 is ASCII for 'a' and 101 is the base we are using), multiplying by the base and adding for the last a of "bra", i.e. 97 Ã- 1010 = 97. If the substrings in question are long, this algorithm achieves great savings compared with many other hashing schemes.
Hash Subset Selection
An algorithm developed in UC Berkley called Winnowing will be implemented in selecting the subset of hashes, which greatly reduces the number of hashes in the subset while in the meantime retains the correctness and improves the comparison time.
Winnowing:-
Consider a sequence of hashes h1 … hn that represents a document. Let window size be w = t−k+1 where t is the guaranteed threshold which is the least length of the detectable substring and k is the noise threshold for which matches shorter than it are undetectable. Each position 1≤i≤n in the sequence defines a window of hashes h1 … hi+w-1. In each window, select the minimum hash value. If there is more than one hash with the minimum value, select the rightmost occurrence. Then save all the selected hashes as the fingerprints of the document.
System Architecture
The system has client server architecture. Administrator of the system fetches the seed URLs to the URL dispenser list. The web pages from the provided URLs are then crawled and put to the Web Base. Fingerprints of the pages are then generated and kept on the fingerprint database. While the end-user first needs to authenticate and then only they can input the documents to check for plagiarism. In addition to documents, end-user must also give the detection type. Then the fingerprints of the input documents are generated. Traffic of end-users is managed by the queuing system. Finally the fingerprints of input documents are matched with the fingerprints of web pages in the fingerprint database and the matched web pages are retrieved for the display.
Experimental Setup
System Implementation Description
Implemented System Components
Fingerprint Module
Till date, we have developed the fingerprint module which generates the fingerprints of the documents. While implementing the fingerprinting module, we observed few problems in winnowing method. This in turn resulted into the substantial amount of memory usage, CPU usage and time delay. We found the defect in the winnowing code and changed the coding mechanism to decrease the computational complexity. A drastic reduction in memory usage and time delay was observed afterwards. However the problem with the large CPU usage remains intact. We will be researching further to cope up with this issue in the coming days.
Crawler Module
The crawler module developed so far is fully functional. It extracts the pages from the given domain and stores in the local disk. There are still few issues with this module. The format conversion of the saved file from .dat to .html and then the text extraction from the html file is the future work to be done.
Front-end Module
Web front end has already been developed till date. This front end will provide options on whether to login or register in the system. Login form validation and registration form validation has been already implemented. File uploading and options on whether to check the plagiarism in plain text or in source code has also been developed completely. Now the work remains on displaying the matched pages in the browser which shall be done in the days to come.
Components to be developed and works to be done in coming days
Database Interfacing
Thread Handling
Database Sharding
System Integration
Testing
Conclusion and Future Work
The final project submission period has been extended contrary to the date we proposed in our initial project proposal. We have certainly not been able to stand on our words as mentioned in the project schedule provided initially. However, considering the project submission deadline, we will be working hardly in team to complete the remaining works.