Writer Identification Current State Of Art English Language Essay

Published: November 21, 2015 Words: 3242

The importance of writer identification has become more significant in these days. This can be used in wide areas, such as, digital rights management in the financial sphere, to solve the expert problems in criminology by forensic expert decision-making systems, where a narrowed-down list of identified writers provided by the writer identification system. By combining with the writer verification as an authentication system this can be used to monitor and regulate the access to certain confidential sites or data where large amounts of documents, forms, notes and meeting minutes are constantly being processed and managed, knowing the identity of the writer would provide an additional value. It can also be used for historical document analysis [1], handwriting recognition system enhancement [2] and hand held and mobile devices [3]. To a certain extent its recent development and performance consider as a strong physiologic modalities of identification, such as DNA and fingerprints [4]. However, the number of researchers involved in this challenging problem is going high as a result of these opportunities.

The handwriting-based writer identification is an active research arena. As it is one of the most difficult problems encountered in the field of computer vision and pattern recognition, the handwriting-based writer identification problem faces with a number of sub problems like a) designing algorithms to identify handwritings of different individuals b) identifying relevant features of the handwriting c) basic methods for representing the features d) identifying complex features from the basic features developed and d) evaluating the performance of automatic methods.

Until 1989 a comprehensive review of automatic writer identification had been given in [5]. As an extension the work from 1989 -1993 has been published in [6]. Consequently the approaches proposed in the last several years renewed the interests in this scientific community for the research topic. The following Figure 1 describes the standard framework of writer identification [7].

Fig. 1 Writer Identification framework [7]

Based on the input method of writing, automated writer identification has classifieds into on-line and off-line. The on-line writer identification task is considered to be less difficult than the offline one as it contains more information about the writing style of a person, such as speed, angle or pressure, which is not available in the off-line one. [8, 9]. Based on the different features associated with the writing, such as character, word, line, paragraph and the document, this has classified. The Figure 2 shows the taxonomy of the classification mentioned above.

Fig. 2 Taxonomy of writer identification w.r.t features of the writing

Text-dependent & text-independent are the other classification of automated writer identification. Dependent on the text content, text-dependent methods only matches the same characters and requires the writer to write the same text consequently. The text-independent methods are able to identify writers independent of the text content and it does not require comparison of same characters. Thus it is very similar to signature verification techniques and uses the comparison between individual characters or words of known semantic (ASCII) content. This method considers as the global style of hand writing text as the metric for comparison, and also got better identification results. As it requires the same writing content this method is not apt for many practical situations. Even though it got a wider applicability, text-independent methods do not obtain the same high accuracy as text-dependent methods do.

The following section describes the various approaches addressed for writer identification in different languages.

Chinese, English and other languages

In the end nineties, Said et al. [14] [15] proposed a text-independent approach for writer identification that derives writer-specific texture features using multichannel Gabor filtering [ref?] and Gray-Scale Co-occurrence Matrices [ref?]. The framework required uniform blocks of text that are generated by word deskewing, and also setting a predefined distance between text lines/words and text padding. Two sets of twenty writers and 25 samples per writer were used in the experimentation. Nearest centroid classification using weighted Euclidean distance and Gabor features achieved 96 percent writer identification accuracy, thus revealing that the two-dimensional Gabor model outperformed gray-scale co-occurrence matrix. A similar approach has also been used on machine print documents for script [16] and font [17] identification.

Zois and Anastassopoulos [18] implemented writer identification in 2000 and verified using single words. Experiments were performed on a data set of 50 writers. The word "characteristic" was written 45 times by each writer, both in English and in Greek. After image thresholding and curve thinning, the horizontal projection profiles were resampled, divided into 10 segments, and processed using morphological operators at two scales to obtain 20-dimensional feature vectors. Classification was performed using either a Bayesian classifier or a multilayer perceptron. The system showed an accuracy of 95% for both English and Greek words. In the writer identification scheme suggested by Marti et al. [30] and Hertel and Bunke [31], text lines was the basic input unit from which text-independent features are computed using the height of the three main writing zones, slant and character width, the distances between connected components, the blobs enclosed inside ink loops, the upper/lower contours, and the thinned trace processed using dilation operations. Using a k-nearest-neighbour classifier, identification rates exceeded 92 percent in testcases on a subset of the IAM database [33] with fifty writers and five handwritten pages per writer. The IAM data set will also be used in the current study.

Graham Leeham et al. proposed a methodology to identify the writer of numerals [43]. The features included parameters such as height, width, area, center of gravity, slant, number of loops, etc. The system was tested among fifteen people and the accuracy was 95%. However to determine the precise accuracy it should be verified across larger databases. Srihari et al. [19], [20] proposed a large number of features for the writing which can be classified into two categories. a) Macrofeatures - They operate at document/paragraph/word level. The parameters used are gray-level entropy and threshold, number of ink pixels, number of interior/exterior contours, number of four-direction slope components, average height/slant, paragraph aspect ratio and indentation, word length, and upper/lower zone ratio. b) Microfeatures - They operate at word/character level. The parameters comprise of gradient, structural, and concavity (GSC) attributes. These features were used originally for handwritten digit recognition in [21]. Text-dependent statistical evaluations were performed on a data set containing thousand writers who copied a fixed text of 156 words (the CEDAR letter) three times. This is the largest data set ever used until now in writer identification methodologies. Microfeatures outperform macrofeatures in identification tests with an accuracy exceeding 80 percent. A multilayer perceptron or parametric distributions were used for writer verification with an accuracy of about 96 percent. Writer discrimination was also done using individual characters in [22],[23] and using words in [24], [25].

Bensefia et al. [26], [27], [28], [29] use graphemes generated by a handwriting segmentation method to encode the individual characteristics of handwriting independent of the text content. Our allograph-level approach is similar to the work reported in these studies. Grapheme clustering was used to define a feature space common for all documents in the data set. Experimentations were done on three data sets containing 88 writers, 39 writers (historical documents), and 150 writers, with two samples (text blocks) per writer. Writer identification was performed in an information retrieval framework, while writer verification was based on the mutual information between the grapheme distributions in the two handwritings which were used for comparison. Concatenations of graphemes are also analyzed in the mentioned papers. An accuracy of about 90 percent was reported on the different test data sets. A feature selection study is also performed in [32].

In [26, 27] Ameur Bensefia et al. have developed a probability based approach using a codebook of graphemes in the IAM and PSI databases. The system accuracy was 95% in IAM database and 86% in PSI database. Also, Laurens van der Maaten et al. have used a combination of simple directional features and codebook of graphemes [41]. The method was tested on 150 writers and the system accuracy was 97%. Vladimir Pervouchine et al. only focused on letters ''t'' and ''h'' on their English identification system. After detecting these shapes in the image, their skeletons were extracted. A cost function along the curve is then calculated and the similarity of cost functions identifies the writer [42]. It is obvious that this method cannot be extended for other languages. Schomaker et al. has presented a method based on fragmented connected-component contours (FCO3) [35, 36]. They used the w2 method in the classification phase to calculate distance. Also, they tested it in an English data set with 150 writers. The top-1 of the method results had 72% and the top-10 had 93% accuracy. However, the top-10 results were satisfactory but its top-1 is not.

Schlapbach et al. implemented an HMM based writer identification and verification method [37, 38]. An individual HMM was designed and trained for each writer's handwriting. To determine which writer has written an unknown text, the text is given to all the HMMs. The one with biggest result is assumed to be the writer. The identification method was tested by using documents gathered from 650 writers. The identification accuracy was 97%. Also, this method was tested as a writer verification method. This was achieved by a collections writings from 100 people and twenty unskilled and twenty skilled imposters, who forged the originals. Experimentations results obtained showed about 96% overall accuracy in verification. Thus it is obvious, that this method can be extended to other languages by applying some changes on feature extraction phase. The difference between the two writer identifications schemes in [39] and [40] is that the former was used in English handwriting and got about 80% accuracy in top-1 results and about 92% in top-10 results while the latter supported Arabic handwriting and its accuracy was 88% in top-1 and 99% in top-10 results.

In 2007, Vladimir Pervouchine et al. [34] implemented a writer identification scheme based on high frequent characters. In this method, the high frequent characters ('f','d','y','th') are first identified, and then according to the similarity of those characters, the writer is selected. The similarity is calculated with respect to the features (such as height, width, slant, etc.) associated with the characters. The number of features associated with each character is different (e.g. 'f' had 7 features while 'th' had 10 ones). A simple Manhattan distance was used in the classification phase. In order to select the best subset of the features, a GA was used which evaluated about 5000 of the subsets, out of 231 possible subsets. The system was tested in a database with 165 writers (between 15 to 30 patterns per writer), and the system accuracy was more than 95%. However, this method is simple and has good results, but the main concern of this method is that if a writer knows the procedure of method, he/she can write a text in test phase such that its characters are totally different with trained ones and so that the system cannot identify him/her.

A major contribution by Bangy Li et al. [10], again in 2007, used the feature vector of hierarchical structure in shape primitives along with the dynamic and static feature for writer identification for 242 writers using NLPR online database and attained a result of above 90% for Chinese and about 93% for English. The substantiation given is that English text contains more oriental information than Chinese text. In 2008, Zhenyu He et al. suggested an offline Chinese writer identification scheme which used Gabor filter to extract features from the text. They also incorporated a Hidden Markov Tree (HMT) in wavelet domain. The system was tested against a database containing 1000 documents written by 500 writers. Each sample contained 64 Chinese characters. The top-1, top-15, and top-30 results had 40%, 82.4%, and 100% accuracy, respectively [12]. Also, these authors have used a combination of general Gaussian model (GGD) and wavelet transform on Chinese handwriting in Ref. [13]. They tested the method on a database gathered from 500 people. This database consisted of 2 handwriting images per person. In the experiments, top-1, top-15 and top-30 results had 39.2%, 84.8% and 100% accuracy, respectively. As, the authors reported, the accuracy of proposed methods was low especially in top-1.

In 2009, YuChenYan et al. [11] presented spectral feature extraction method based on Fast Fourier Transformation which was tested on the 200 Chinese handwriting text collected from 100 writers. The methodology showed 98% accuracy for top 10 and 64% for top1 using the Euclidean and WED classifiers. This scheme has the advantage of stable feature and also it reduces the randomness in Chinese character. Another advantage is that it is feasible for large volume of dataset. However it has higher computation costs.

1.2 Arabic

Bulacu et al. proposed text-independent Arabic writer identi­fication by combining some textural and allographic features [40, 45]. After extracting textural features (mostly relations bet­ween different angles in each written pixel) a probability distribution function was generated and the nearest neighbor­hood classifier using the x2 as a distance measure was used. For the allographic features, a codebook of 400 allographs was generated from the handwritings of 61 writers and the similarity of these allographs was used as another feature. The database in experiments consisted of 350 writers with 5 samples per writer (each sample consisted of 2 lines (about 9 words)). The best accuracy seen in experiments was 88% in top-1 and 99% in top-10. Also, a simpler definition of this method was presented by M. Bulacu et al. earlier in [46].

Also, Ayman Al-Dmour et al. designed an Arabic writer identification system in 2007 [47]. Different feature extraction methods such as hybrid spectral-statistical measures (SSMs), multiple-channel (Gabor) filters, and the grey-level co­occurrence matrix (GLCM) were verified to find the best subset of features. For the same purpose a support vector machine (SVM) was used to rank the features and then a GA (whose fitness function was a linear discriminant classifier (LDC)) chose the best one. Several classification methods such as LDC, SVM, weighted Euclidean distance (WED), and the K nearest neighbors (KNN) were also considered. The KNN-5, WED, SVM, and LDC results after feature selection per sub-images were reported as 57.0%, 47.0%, 69.0% and 90.0%, respectively. The results were better when the whole image was used, for instance the LDC result was increased to 100% (with no rotation). The database tested was gathered from 20 writers; each writer was asked to copy 2 A4 documents, one for training and the other one for testing. The used documents for each writer were different from the others and the sub-images were generated by dividing each document into 3x3 = 9 non-overlapping images. However, this method has good accuracy when LDC was used, but it seems the test database and samples per writer was small and it needs to be tested on more popular dataset. Faddaoui and Hamrouni opted for a set of 16 Gabor filters [48] for handwriting texture analysis. Gazzah and Ben Amara applied spatial-temporal textural analysis in the form of lifting scheme wavelet transforms. Angular features were considered as well in the task of Arabic writer identification [49].

Somaya Al-Ma'adeed et al. presented a text-dependent writer identification method in Arabic using only 16 words [44]. The features extracted include some edge-based directional features such as height, area, length, and three edge-direction distributions with different sizes and WED has been used as classifier. The test data was 32 000 Arabic text images from 100 people; the system was trained with 75% of the data and tested it by using 25%. They did not mention the top-1 accuracy of the method, but the best result in top-10 was 90% when 3 words were used. The main concern of this method is its dependency to text and the small dataset that were used in experiments. This method employed edge-based directional probability distributions, combined with moment invariants and structural word features, such as area, length, height, length from baseline to upper edge and length from base line to lower edge. On the other hand, Abdi et al. used stroke measurements of Arabic words, such as length, ratio and curvature, in the form of PDFs and cross-correlation transform of features [50] for the writer identification scheme.

Although, Arabic language is similar to Persian in character set and some writing styles, the Arabic methods cannot be extended to Persian language completely because of some special symbols that exists in Arabic language.

1.3 Persian

In 2006, Shahabi et al. propsed a Gabor based system for Persian writer identification and the accuracy of their work was reported about 92% in top-3 and 88% in top-1[51]. It is observed that the testing was not adequate; because in the test phase, there was only one page per person such that 34 of it were used in training and the rest of page used in test phase. To verify these results in more general way, we have implemented and tested their method; where 5 pages for each writer were used in training phase and another separate page was used in test phase; the method accuracy was of 60% in 80 people. In another scheme, Soleymani Baghshah et al. designed a fuzzy approach for Persian writer identification [57]. In this proposal, fuzzy directional features were used and the fuzzy learning vector quantization (FLVQ) has been trained in order to recognize the writers. The drawback of this method is that it only works on disjoint Persian characters that are not conventional in Persian language. This system was tested using 128 writers and results were around 90%-95% in different situations of test.

A Persian handwritten identification system that was based on a new generation of Gabor filter that was called XGabor filter is proposed in 2008 [52]. Feature extraction was done using Gabor and XGabor filters; in the classification phase, weighted Euclidian distance (WED) classifier was used. In order to test the system, we organized a data set of 100 people's handwritings which has been referred by some other works also. This data set is called PD100 and it is referenced by this word in present paper. The proposed method in [52] got 77% accuracy using the PD100. Rafiee and Motavalli [58] introduced a new Persian writer identification method, using baseline and width structural features, and relying on a feed forward neural network for the classification.

In another recent work, we proposed an LCS (longest common subsequence) based classifier to classify features that are extracted by Gabor and XGabor filters [53,54]. This classifier improved the system accuracy up to 95% on PD100. However, the features extracted by XGabor filter could model the characteristic of written documents but the accuracy of these methods was not proper because of problems in data classification and representa­tion. Therefore, in the present paper, we used XGabor filter together with Gabor filter with different data representation, classification, and identification schemes. In another research, a mixture of some different methods has been used by Sadeghi ram et al. Grapheme based features are clustered by fuzzy clustering method and after selecting some clusters, final decision is made based on gradient features. The scheme got about 90% accuracy in average on 50 people that were selected randomly from PD100 [55].They also used a three layer MLP(multi layer perceptron) to classify the gradient based features, and they got about 94% average accuracy on same data set [56]. To the best of our knowledge, there is no other reported method in Persian writer identification.