Character Recognition Using Neural Networks Computer Science Essay

Published: November 9, 2015 Words: 2203

This paper briefly explains about Character recognition. It is the technique to translate the scanned pictures using the optical scanners into some other format that can be edited using computer editors. Methods used for character recognition and its application are also discussed. Quite research has been done on English alphabets. Now a day's OCR machines can recognize up to 2500 character per minute

Introduction

Great need was required of character recognition software since the use of internet is growing. More the use of computers and internet increased greater was the demand of online information. Digitizing all the papers work was enormous if one has to two types so the research on character recognition begun work.

Character recognition, it is process of scanning document of typed information sometimes handwritten to , and extracting the characters of the alphabet and thus making it possible for us to save it in digital editable format . In other words we can say that character recognition is an simple process to scan the image from a scanner and then covert it back into the words.

All training and testing inputs were in bitmap format (.bmp) because this is a very common way to save images that have been scanned. Optical character recognition or simply character recognition provides easy way of data entry

Working of OCR

There are basically two techniques

Matrix matching

Feature extraction

Most widely used method is matrix matching. it is simpler and less complex then the other .

In Matrix Matching method, it compares the OCR scanned image with standard library of character matrices already stored . When there is an matching between one of these standard templates of matrices of dots within a limit of similarity, the software then marks the image with ASCII character .

The second method that is Feature Extraction, it is Optical Character Recognition without any strict matching to standard prescribed template of library. this method is also called s Intelligent Character recognition , some also refer to it as Topological Feature Analysis , this accuracy of this method depends on level of complexity used by developer . The software scans for features like non closed areas, closed areas, diagonal lines, straight lines, line crossings and other graphical features. This method is more dynamic than other method but there is more complexity in this. Matrix matching only works best if there are only few types of styles, little noise, with little or no variation on characters. For handwriting matching or if characters are of many styles variations feature extraction method is used.

Optical character Recognition fonts

Font is a set of characters which is y 0 - 9, A to Z, and it also includes special characters. Font consist of a characters which have a defined features that can be used in any size . For the use of OCR , there is a standard as defined by ANSI. OCR uses fonts that can be easily recognized by the slow speed, cheap systems . These are fonts are easy to be read by both scanner and human. Fonts that have greater accuracy are used as OCR fonts

A font is in which all the characters are effectively the same width, regardless of the actual size of the letters, numbers or symbols in the font can be used as OCR fonts

Pre-processing was needed to turn a scanned image into OCR ready inputs. Preprocessing includes may include following process but are not limited to these

Image id reduced to black and white

Or in a binary form

and then to a matrix of ones and zeros, where ones indicated white pixels and zeros the black pixels.

OCR scanners

OCR scanners are the reading devices that used for inputting the document to computer these are classified into categories,

Text Input and

Data Capture.

Text input devices are those which are used to read pages or scn documents or large parts of documents or even a book . The source is scanned with an objective to use it for edition purposes after it is scanned. These devices have various levels of automation from manual feeding to having automatic feeding and then , reading, after that sorting, and even stacking capabilities.

The other category of devices that is Data Capture devices used to scan data that is repeated several times and then do some pre specified formatting on the scanned e data as it is being entered. The data that is delivered from the OCR scanner to the software should be accurate as it is not to be used for editing later and manual work will be done on it , so it requires more accuracy then text input OCR scanners

Preparation of Data

The first part of the project consisted on gathering sample data and targets to train the neural network with. In this project, the 12 pt. Courier New font was used to generate the capital letters of the alphabet, and also an empty space. The character set in figure 1 was saved in .bmp format and given to the neural network to use for training.

ABCDEFGHIJKLMNOPQRSTUVWXYZ

Figure 1. Courier New Training Set

Each letter served as an input having 108 attributes. See figure 2 for a sample character from the Courier New font family having 12*9 attributes.

Figure 2. Courier New font SampleA normalized vector from 1 to 27 defined the targets for each of the 27 inputs. Therefore, the output for the network would be a number between 0 and 1, with 27 possible values.

Next, an ideal word was created and saved in bitmap format for testing the network, just to make sure Matlab was simulating the network correctly and that the network was at least working with the training data.

The word 'PERCEPTRON' was used for testing the network to make sure the training was successful. Figure 3 shows how the bitmap looked that Matlab received.

PERCEPTRON

Figure 3. Ideal test data

Then, non-ideal data from a scanner was used for testing the network. This non-ideal data was typed and printed out and then scanned back in to simulate the real-world process of scanning in a page of text. Figure 4 contains a close-up of a piece of scanned data.

Figure 4. Non-Ideal sample

After receiving a non-ideal input such as the one in figure 4, Matlab has to convert it to a black and white image. After conversion to a binary image, much information is lost and the letters also appear noisy. The scanned data looks like that in figure 5.

Figure 5. Non-Ideal black and white sample

Then, Matlab converts the black and white image to a matrix of ones and zeros. For example, the letter 'Q' can be spotted in the matrix after being converted:

1 1 1 0 0 0 1 1 1

1 1 0 1 1 1 0 1 1

1 0 1 1 1 1 1 0 1

1 0 1 1 1 1 1 0 1

1 0 1 1 1 1 1 0 1

1 0 1 1 1 1 1 0 1

1 0 1 1 1 1 1 0 1

1 0 1 1 1 1 1 0 1

1 1 0 1 1 1 0 1 1

1 1 1 0 0 0 1 1 1

1 1 1 1 0 0 1 1 1

1 1 0 0 1 1 0 0 1

Figure 6. Binary Matrix Representing Q

Architectures Tested

For all the architectures used, there were 27 input vectors each having 108 attributes.

Linear Associator With Pseudoinverse Rule

The first architecture that was used to attempt character recognition was the Linear Associator using the Pseudoinverse rule. The Pseudoinverse was used instead of the Hebb rule because the prototype patterns were not orthogonal. The Pseudoinverse rule was preferred over other learning rules because of its simplicity. The weight matrix for the linear associator using the Pseudoinverse rule can be found using the following matrix equation:

W=TP+

Where P+ is the pseudoinverse defined by P+=(PTP)-1PT

After forming the input matrix P, and corresponding target matrix T, the weight matrix was easy to calculate. Because of the rule's simplicity, changing the weight matrix for a new set of fonts would be quick enough to do on-the-fly.

The Linear Associator gave better results than any other network tested, so this was the one chosen in the final version of the project.

4-Layer Networks With Backpropagation Algorithm

Several different architectures were experimented with, starting with a 4-layer network having 12 neurons in the first 2 layers, 2 neurons in the 3rd layer, and 1 neuron in the 4th layer. With all the transfer functions as tangent sigmoid, the ideal data was loaded and the network converged to a minimum error after about 50 epochs. The network was tested with the ideal data, and found to properly identify the letters, but with the non-ideal data, the network could not identify any of the characters.

The network was probably over-learning the prototype data set, so the number of neurons in each layer was changed a couple times. Even with mean squared errors (MSE) under .01, the network could not properly identify the non-ideal data.

5-Layer Network With Backpropagation Algorithm

Of the few 5-layer networks tested, the one with the best results had 2 neurons in the first layer and 5 neurons in the 2nd, 3rd, and 4th layers, and 1 neuron in the 5th layer. The tangent sigmoid function was used on the first 4 layers, and a pure linear function was used on the 5th layer.

Upon training, the network reached an MSE of virtually zero. When tested with non-ideal data, the performance was much better than with the 4-layer network, but still not as good as with the Linear Associator.

Results and Analysis

Using an ideal prototype data set, the best results for the 3 types of networks used are as follows:

Figure 7. %Accuracy Using Ideal Prototypes

Note: These percentages do not include the spaces in the sentences that each network easily recognized. If taken into account, these percentages would be much higher.

Since the accuracy is obviously too poor, various measures were taken to try to improve performance. These included:

Using edge detection on non-ideal data

Using different schemes for the targets

Sorting the prototype letters by similar shape and size

None of these attempts noticeably affected the performance.

The main reason the performance was so low was because of a character-offset effect that occurred when Matlab reduced the scanned image to black and white. See figure 8. The middle image is the ideal prototype, centered about its 9-pixel width, and the outer two images are what the scanned character may look like. Even though all the characters are identical, the offset makes it hardly possible for the neural network to identify it correctly.

Figure 8. Offset effect

Next, attempts to edit the prototype patterns were made because the prototype patterns should match (as well as possible) the non-ideal data that will be gathered.

For testing the effects, the Linear Associator was used because it had already been yielding better results than the other networks tested.

The first edit to the prototype patterns involved adding noise in places the scanned images looked noisy. For the scanned images in this project, most of the noise appeared at the top of the letters, so that is where the noise was added to the prototype patterns. This method increased the accuracy of the Linear Associator to 12%.

Then, non-ideal prototype patterns were created using the same method the non-ideal data was gathered. This greatly improved the performance. The Linear Associator gave an optimum accuracy of 21%.

Figure 9. %Accuracy Using Different Prototypes

Advantages of OCR

There are various reasons for using OCR scanning method then other methods of data entry like bar code . Advantages include but not limited to

Lesser data entry error in comparison to manual entry

To join several Data Entry in digitized form

To efficiently Handle Peak Loads

Make it Human Readable and editable form

It can Be easily Used with to print again

Can be helpful in Scanning Corrections

In this project, various networks were trained to recognize characters of the alphabet from a scanned image. The Linear Associator preformed the best and was also the simplest to implement.

After trying various methods to improve the performance, a character recognition accuracy of 21% was achieved when the prototype data was generated from the same source the test data was coming from. An accuracy of 21% means that out of every 100 letters, 21 will be correctly identified.

This accuracy is still very low, so other methods need to be approached for this type of character recognition, such as doing a more complicated edge detection algorithm, or using characteristic area ratios (for example, of black pixels to white pixels) of the characters to identify them.

Appendix - Matlab Code Explanation

An explanation of provided matlab files:

project.m - GUI for the character recognition project

getsamples.m - Gets prototype data in bitmap format

hebb.m - Simulation of Hebbian learning using Linear Associator

readline.m - Reads and simulates network on a line of image data

(bitmap format)

projectresult.txt - File where resulting line of text is stored