Custom Dictionaries And High Dictionary Compression Computer Science Essay

Published: November 9, 2015 Words: 1493

It is a java library to provide local spell checking service. Suggester Spell Check is a 100% pure Java library to provide local spell checking service. And it is absolutely free for us to use with already pre-compiled dictionaries. Suggester Spell Check uses Basic Suggester Engine as a spellchecker.

Suggester spell check providing recommendations for unknown words in user query for local search systems. System administrator can create a list of preferred words and assign higher weight to such words. As a basic implementation Suggester can serve as a spellchecker. In this case all words have the same weight. But don't be confused about "basic implementation". It includes high speed suggestion engine, based on fast edit-distance calculation algorithm enhanced with Lawrence Philips Metaphone algorithm and private fuzzy-matching algorithm.

The Suggester uses not only shortest edit-distance measure but also the Metaphone algorithm and on top of it private fuzzy-matching algorithm to place the closest replacement in the top position in the suggestion list. You can adjust influence from each algorithm using configuration file.

Local service:

Unlike Google's Spelling API the Suggester library and a dictionary file is all you need to have local spellchecking service fully under your control. No need to worry about exposure on Internet, connectivity problems and availability of external service.

Custom dictionaries:

The Index Builder allow users to create custom dictionaries from user's word-list. It can also be used to extract all words from the dictionary and modify existing dictionaries.

High dictionary compression:

The word dictionary is compressed not only on a hard drive, but also in memory. Basic UK English dictionary contains about 57000 words and has a size about 90K. Full English dictionary contains about 200,000 words (including names, abbreviations, geographic places, etc.) and it takes 236Kb file on a hard drive and about 2Mb space in memory.

High dictionary search and suggestion selection speed:

Dictionary case dependent / independent look-up takes about 0.002 / 0.005 ms per word, which comes to speed about 500,000 / 200,000 words per second. Suggestions search speed averages about 40 ms per set of suggestions for each unknown word on Pentium M 1.4Gz (with high quality of suggestions).

Portability:

The Suggester software entirely written in Java 1.2. Runs on any Java® platform: Windows®, Mac OS®, Unix, Linux. Tested on JRE 1.2, 1.3, 1.4, 1.5.

Requirements

A Java 1.2 or later compatible virtual machine for your operating system.

There is no other special requirements to run Suggester as a Spellchecker.

To run Index Builder you may need up to 512 Mb (or more) of virtual memory.

Explanation of coding

----------------------------

Simple demo how to use BasicSuggester. The BasicSuggester uses Configuration and Dictionary objects.

// load English dictionary from jar file

BasicDictionary dictionary = new BasicDictionary("file://english.jar");

// load basic suggester configuration from file

BasicSuggesterConfiguration configuration = new

BasicSuggesterConfiguration("file://basicSuggester.config");

// create Suggester based on configuration and attach dictionary

BasicSuggester suggester = new BasicSuggester(configuration);

suggester.attach(dictionary);

// get and display up to 10 suggestions

ArrayList suggestions = suggester.getSuggestions(word, 10);

for (int j = 0; j < suggestions.size(); j++)

{

Suggestion suggestion = (Suggestion) suggestions.get(j);

System.out.println("suggestion " + (j + 1) + ": " + suggestion.getWord());

}

System.out.println("\nTotal found: " + suggestions.size());

Apache Lucene

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download.

Features

Lucene offers powerful features through a simple API:

Scalable, High-Performance Indexing

over 20MB/minute on Pentium M 1.5GHz

small RAM requirements -- only 1MB heap

incremental indexing as fast as batch indexing

index size roughly 20-30% the size of text indexed

Powerful, Accurate and Efficient Search Algorithms

ranked searching -- best results returned first

many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more

fielded searching (e.g., title, author, contents)

date-range searching

sorting by any field

multiple-index searching with merged results

allows simultaneous update and searching

Cross-Platform Solution

Available as Open Source software under the Apache License which lets you use Lucene in both commercial and Open Source programs

100%-pure Java

Implementations in other programming languages available that are index-compatible

Explanation of coding

----------------------------

The first step in any Lucene application involves indexing your data. Lucene needs to create its own set of indexes, using your data, so it can perform high-performance full-text searching, filtering, and sorting operations on your data.

This is a fairly straightforward process. First of all, you need to create an IndexWriter object, which you use to create the Lucene index and write it to disk. Lucene is very flexible, and there are many options. Here, we will limit ourselves to creating a simple index structure in the "index" directory:

Directory directory = FSDirectory.getDirectory("index", true);

Analyzer analyser = new StandardAnalyzer();

IndexWriter writer = new IndexWriter(directory, analyser, true);

Next, you need to index your data records. Each of your records needs to be indexed individually. When you index records in Lucene, you create a Document object for each record. For full-text indexing to work, you need to give Lucene some data that it can index. The simplest option is to write a method that writes a full-text description of your record (including everything you may wish to search on) and use this value as a searchable field. Here, we call this field "description."

You index a field by adding a new instance of the Field class to your document, as shown here:

Field field = new Field("field", value, Field.Store.NO, Field.Index.TOKENIZED)

doc.add(field);

The Field.Index.UN_TOKENIZED is useful if you want to index a field without analyzing it first. If you simply wish to store the value for future use (for example, an internal identifier), you can use Field.Index.NO.

Full-text searches

Full-text searching in Lucene is relatively easy. A typical Lucene full-text search is shown here:

Searcher is = indexer.getIndexSearcher();

QueryParser parser = indexer.getQueryParser("description");

Query query = parser.parse("Some full-text search terms");

Hits hits = is.search(query);

Here, we use the indexer to perform a full-text search on the description field. Lucene returns a Hits object, which we can use to obtain the matching documents, as shown here:

for (int i = 0; i < searchResults.length(); i++) {

Document doc = searchResults.doc(i);

String title = (String)doc.getField("title");

System.out.println(title);

}

Multi-criteria searches

Extending this code to implement multi-criteria searches requires a bit more work. The key class we use here is the Filter class, which, as the name indicates, lets you filter search results.

The Filter class is actually an abstract class. There are several types of filter classes that let you define precise filtering operations.

The QueryFilter class lets you filter search results based on a Lucene query expression. Here, we build a filter, limiting search results to books, using the type field:

Query booksQuery = new TermQuery(new Term("type",Item.BOOK));

Filter typeFilter = new QueryFilter(booksQuery);

The RangeFilter lets you limit search results to a range of values. The following filter limits search results to items dated between 1990 and 1999 inclusive, using the year field (the last two Boolean fields indicate whether the limit values are inclusive or not):

Filter rangeFilter = new RangeFilter("year","1990", "1999", true, true);

The ChainedFilter lets you combine other filters using logical operators such as AND, OR, XOR, or ANDNOT. In the following example, we limit search results to only the documents matching both of the above conditions:

List<Filter> filters = new ArrayList<Filter>();

filters.add(typeFilter);

filters.add(rangeFilter);

Filter filter = new ChainedFilter(filterList, ChainedFilter.AND);

You can either apply the same operator to all filters or provide an array of operators, which lets you provide different operators to be used between each filter.

Documentation Link :

------------------------------

http://lucene.apache.org/java/3_0_1/api/all/index.html

http://en.wikipedia.org/wiki/Lucene

http://darksleep.com/lucene/

http://www.javaworld.com/javaworld/jw-09-2006/jw-0925-lucene.html

http://lucene.apache.org/java/3_0_1/gettingstarted.html

http://en.wikipedia.org/wiki/Spell_checker

http://www.softcorporation.com/products/spellcheck/

http://zzihee.net/toy/suggester/ http://www.devdaily.com/java/jwarehouse/jazzy/src/com/swabunga/spell/event/SpellChecker.java.shtml

Issues Faced.

1. Cannot save index to 'index' directory, please delete it first

Explanation : Running the application first time will create a folder named index and some files inside of it for indexing search data. When exiting from the application, the folders can't delete. So in the next time when the application starts this error occurs and application exits from the execution.

Solution : Before checking the presence of index folder, delete the index folder.

2. Search functionality receives contents to be searched in different files. But our contents are saved in one file which is DB file.

Solution : All the article headers are taken from DB through Cursor class. After that created temporary files for each header with prefix tmp_ and extension .tmp with contents header of the article. These files are stored in a folder named tmp. This tmp folder is passed for content searching.

3. In spell check UI, after completing the spell checking a null string value is set to txtChangeTo text field.

Solution : Checked in the Replace button's click event handling method whether the hasMisspelt() method returns true or false. If it returns false then the dialog is disposed.

NB. The requirements of,

Spell Checker

Advanced Search

Enhancement of UI

are successfully completed in this phase.

Thanks and looking forward.