A Survey Of Plagiarism Detection Methods Information Technology Essay

Published: November 30, 2015 Words: 2489

INTRODUCTION

With the hug of the information on WWW and digital libraries, Plagiarism became one of the most important issues for universities, schools and researcher's fields. It is so easy through the internet and due to using advanced search engine to find documents or journals by students. Some of the researchers are just copying and pasting others works without reference to the owner of the documents.

DEFINE PLAGIARISM

Plagiarism can be defined as "Plagiarism is the act of taking another person's writing, conversation, song, or even idea and passing it off as your own. This includes information from web pages, books, songs, television shows, email messages, interviews, articles, artworks or any other medium. "

According to the Merriam-Webster Online Dictionary, to "plagiarize" means:

To steal and pass off (the ideas or words of another) as one's own.

To use (another's production) without crediting the source.

To commit literary theft.

To present as new and original an idea or product derived from an existing source.

Also according to Turnitin.com and Research Resources this are considered plagiarism:

Turning in someone else's work as your own.

Copying words or ideas from someone else without giving credit.

Failing to put a quotation in quotation marks.

Giving incorrect information about the source of a quotation.

Changing words but copying the sentence structure of a source without giving credit.

Copying so many words or ideas from a source that it makes up the majority of your work, whether you give credit or not (see our section on "fair use" rules).

Plagiarism can be classified into five categories:

Copy & Paste Plagiarism.

Word Switch Plagiarism.

Style Plagiarism.

Metaphor Plagiarism.

Idea Plagiarism.

WHY PLAGIARISM DETECTION IS IMPORTANT

In schools and universities Plagiarism detection is important; to find that Most of the students are usually cheating, in their homework's, reports or projects. So it better and must all academic fields they should have to use plagiarism detection software to stop or to eliminate student cheating, copying and modifying documents when they know that they will be found.

Note that there are already anti-anti plagiarism detection tools available that help students who want to cheat: students can submit a paper and get a changed version in return (typically many words replaced by synonyms), the changed version fooling most plagiarism detection tools.

Plagiarism not only by student also there are some staff they like to publish paper in some that direct copied or partially modified to be one of the famous people.

Still now there is no plagiarism detection tool that can proves that a document has been copied from some other source(s), but is only giving a hint that some paper contains textual segments also available in other papers. The first author of this paper submitted one of his papers that had been published in a reputable journal to a plagiarism detection tool. This tool reported 71% plagiarism! The explanation was that parts of the paper had been copied by two universities using OCR software on their servers! This shows two things: first, the tools for plagiarism detection can be used also to find out whether persons have copied illegally from one's own documents and second, it can help to reveal copyright violations as it did in this case: the journal had given no permission to copy the paper!. Also plagiarism detection tools may be used for a somewhat different purpose than intended like the discovery of copyright violation.

some journals or conferences are now running a check on papers submitted in a routine fashion: it is not so much that they are worried about plagiarism as such, but (i) about too much self-plagiarism (who wants to publish a paper in a good journal that has appeared with minor modifications already elsewhere?) and (ii) about copyright violation. Observe in passing that copyright statements that are usually required for submissions of papers to prestigious journals ask that the submitter is entitled to submit the paper (has copyright clearance), but they usually do not ask that the paper was actually authored by the person submitting it. This subtle difference means that someone who wants to publish a good paper may actually turn to a paper mill and order one including transfer of copyrights!

PLAGIARISM DETECTION METHODS

There are a big numbers of methods used for plagiarism detection developed by researchers in past years, here we mentioned the latest and the importance and more effectives tools for automatic plagiarism detection, used by some researchers:

Firstly, some of the researchers use Natural language text copy detection technique, this technique appeared in the 1990s, and has produced three detection approaches [2]:

4.1. Grammar-based method.

This method focuses on the grammatical structure of documents, and uses a string-based matching approach to measure similarity between the documents. But this method it have some limitation which is the grammar-based method to detect verbatim copying can get better results than using it to detect the copied text including synonym replacement or rewriting.

Huang [3] proposed a similar web pages detection method based on the LCS (Largest Common Subsequence) algorithm by finding the largest common string between two pages to calculate the similarity of the two pages.

Winnowing algorithm [4] uses overlapping k-gram method to get hashes of the documents, and it uses moving window to select the minimum hash value from each window to obtain the fingerprints of the document, and then it calculates the rate of the matching fingerprint to get the similarity between the two documents.

Hashbreaking [5], DCT [6] are also the grammar-based methods; the only difference between them is how to get the fingerprints of the document.

4.2. Semantics-based method.

This method uses the vector space model of the Information Retrieval Technology, and statistics word frequency in a document to obtain feature vector of the document, then uses dot product, cosine, etc. to measure the feature vector of the two documents. This feature vector is the key of the document similarity. Also this method it have some limitation which are: the Semantics-based method, which is not always effective to detect partial plagiarism, because it is difficult to determine the location of copied text.

4.3. Grammar semantics hybrid method [1].

This method is used to solve the problems of the two methods mentioned above, and improve the detection results. Locating is an important step of text copy detection technology. It is required to give the position of the plagiarized content in the document in addition to calculating the document similarity.

Secondly, some of approaches used task specific index structures likes:

Malcolm and Lane [12] used the desktop plagiarism detection system Ferret, which is based on common word tri-grams.

Basile et.al. [13 ] Encoded texts as a word length sequence and used a downstream vector-based n-gram distance measure for candidate selection.

Kasprzak et.al. [14] Incorporate common text shingles in the pre-selection pro cess and Shcherbinin and Butakov employed hash-based fingerprints for candidate retrieval.

Grozea et.al [15] used string kernels to compute a complete similarity matrix for each pair of source and suspicious document.

Thirdly, External plagiarism detection method:

The external plagiarism detection relies on a reference corpus composed of documents from which passages might have been plagiarized A passage could be made up of paragraphs, a fixed size block of words, a block of sentences and so on. A suspicious document is checked for plagiarism by searching for passages that are duplicates or near duplicates of passages in documents within the reference corpus. An external plagiarism system then reports these findings to a human controller who decides whether the detected passages are plagiarized or not. A naive solution to this problem is to compare each passage in a suspicious document to every passage of each document in the reference corpus. This is obviously prohibitive. The reference corpus has to be large in order to find as many plagiarized passages as possible. [7].

This fact directly translates to very high runtimes when using the naive approach. External plagiarism detection is similar to textual information retrieval (IR) (Baeza- Yates and Ribeiro-Neto, 1999). Given a set of query terms an IR system returns a ranked set of documents from a corpus that best matches the query terms. The most common structure for answering such queries is an inverted index. An external plagiarism detection system using an inverted index indexes passages of the reference corpus' documents.

For each passage in a suspicious document a query is send to the system and the returned ranked list of reference passages is analyzed.

Such a system was presented in (Hoad and Zobel, 2003) for finding duplicate or near duplicate documents.

Another method for finding duplicates and near duplicates is based on hashing or fingerprinting. Such methods produce one or more fingerprints that describe the content of a document or passage. A suspicious document's passages are compared to the reference corpus based on their hashes or fingerprints. Duplicate and near duplicate passages are assumed to have similar fingerprints.

One of the first systems for plagiarism detection using this schema was presented in (Brin, Davis, and Garcia-Molina, 1995).

External plagiarism detection can also be viewed as nearest neighbor problem in a vector space Rd.

Examples of this of researches used for external plagiarism detection are:

Automatic Detection of the Direction of Plagiarism, it was used to determining the direction of the plagiarism, the use an extension of the Encoplot method, tested on large scale of artificial plagiarism, they shown that the on largest plagiarism corpus available to date the problem of direction of the plagiarism is solved with fairly high accuracy (about 75%), but they do not tested in natural language. Critian Grozea and Marius Popescu [11]

Automatic External Plagiarism detection using passage similarities, this approach used in detecting external plagiarism for the pre-processing stage, to indentify non-English documents and translate them onto English, then the index them and retrieve the top documents that are similar to the suspicious. They divide the retrieved documents into passage s which each passage contains twenty sentences, the plagiarism is detected by identifying the number if overlapped words between suspicious and source passage. Clara Vania and Mirna Adriani [16] .

Sobha Lalitha Devi, Pattabhi R K Rao, Vijay Sundar Ram and A Akilandeswari , they develop algorism to detecting external plagiarism in PAN-10 competition. The algorithm has two steps 1. Identification of similar documents and the plagiarized section for a suspicious document with the source documents using Vector Space Model (VSM) and cosine similarity measure and 2. Identify the plagiarized area in the suspicious document using Chunk ratio. But the preprocessing of the documents is not done. [17]

Gupta Parth, Rao Sameeer, and Prasenjit Majumdar, use N-Gram approach for external plagiarism detection, they developed the system for the external plagiarism detection, in which plagiarized chunks need to be found from given large source documents collection, but they need to consider more candidate documents[18].

Markus Muhr, Roman Kern, Mario Zechner, and Michael Granitzer , they present their hybrid system for the PAN challenge at CLEF 2010. their system performs plagiarism detection for translated and non-translated externally as well as intrinsically plagiarized document passages, the external plagiarism detection approach is formulated as an information retrieval problem, using heuristic post processing to arrive at the final detection results[19].

Zechner et.al. [20] They team employing a standard model of textual IR for candidate selection. The source documents were indexed on a sentence level and sentences of suspicious documents were used as queries. Similarity was calculated via the well established cosine measure. The mapping between sentences from the suspicious document to the sentences of source documents also provided the alignment of similar subsequences.

Lastly and which the important method to us is clustering in plagiarism detection:

Document clustering has been demonstrated as a successful way of improving the performance of several tasks in Information Retrieval like document retrieval, text summarization or results presentation [8]. The main problem of applying clustering techniques in real retrieval systems is the computational cost. Traditionally the clustering algorithms have a high computational complexity in terms of space and time [21].

The basic idea in the fingerprint-based approach is to create a kind of fingerprint for every document in the collection. Each fingerprint may contain several numerical attributes that somehow reflect the structure of the document. For example, the system can store the average number of words per line, the number of lines, the number of passages, the number of unique words, and so on. If two fingerprints are close to each other (according to a distance function), the documents themselves can also be considered as being similar. Winnowing was presented with the objective of plagiarism detection, but the fingerprint construction guarantees also a set of theoretical properties in terms of fingerprint density and sub-string matching detection. Some example of it was done by:

A cluster-Based Plagiarism Detection method, use the grammar-based method, (Winnowing's fingerprint extraction algorithm) which divided in to three steps: first step called pre-selecting, is to narrow the scope of detection using the successive same fingerprint, the second, called locating, is to find and merge all fragments between two documents using cluster method, the third step, called post-processing is deal with some merging errors. Du Zou, Wej-jaining long and Zhang Ling. [9].

Evaluation of Text Clustering Algorithms with N-Gram-Based Document Fingerprints, they have implemented two traditional clustering algorithms with document representation based on winnowing fingerprints, adapted the similarity measures for working with multi-sets and designed a new way of centroid computation. They compared the performance of winnowing fingerprints with term frequency and mutual information and n-fingerprints with four different metrics and with three different collections. The achieved results show that further evaluation of the presented approach in tasks like cluster based retrieval or clustering of web results should be performed.

CONCLUSION AND FURTHER WORK

Plagiarism is so difficult to be 100% detected by recent methods so it will continue rising and raising up, according to the above survey, each method has some advantages and disadvantages. Most of them use clustering as techniques of sorting and summarization tool. According to the latest research it is advised to use the cluster based retrieval or clustering to achieve better results.

With limitation of the grammar-based method and the Semantics-based method, we suggest that we use Semantics-based method for cluster based method as it will achieve much better results.

For further work, we are going to make some comparisons between the recent software used for plagiarism detection and the above mentioned methods according to: 1- Supported languages, 2- Extendibility, 3- Presentation of results, 4- Usability, 5- Exclusion of template code, 6- Exclusion of small files, 7- Historical comparisons, 8- Submission or file-based rating, 9- Local or web-based, 10- Open source.

References

Bao Jun-Peng,Shen Jun-Yi,Liu Xiao-Dong,Song Qin-Bao, "A Survey on Natural Language Text Copy Detection[J]",Journal of Software, 2003, vol.14, No.10, pp.1753-1760(Ch).

Wang Tao, Fan Xiao-Zhong, Liu Jie, "Plagiarism Detection in Chinese Based on Chunk and Paragraph Weight",in Proceedings of the Seventh International Conference on Machine Learning and Cybernetics, Kunming, pp.2574-2579, July 2008.