Wednesday, December 9, 2009

Urdu OCR - A Digital Dream

Optical Character Recognition is a unique approach developed for recognizing isolated character that requires less complex calculations but still giving adequate results. In case of document image recognition, an additional step of detecting lines of text and possible set of character among those lines is a requisite. There are numerous methods available for character recognition. From numerical and statistical approach to AI based approach in an increasing order of their recognition accuracy, respectively. None of the approaches stated has recognition accuracy of 100%.Even the humans are not credited with absolute recognition accuracy. The main objective of the recognition software is to help its user in more physically tiring and cumbersome work of actually typing the whole document especially for a user. The error correction still resides with its user only. Hence, a recognition accuracy of even about 90% gives very satisfactory results. Apart from all this, the image quality also plays a very important role in the recognition accuracy.

So, a research project named Urdu OCR – A Digital Dream from Usman Institute of Technology fulfilling the needs. The team members of this project are Abdul Wahab(ME), Shuwair Sardar, and Muhammad Abdul Sammad Khan. First prize winner of Combat 2008 (Software Competition – PAF Kiet) and Software Exhibition (Software Competition – SZABIST) and Second prize winner in NED

Urdu OCR is developed for first time. It has not been developed yet. The need of this product is in the printing media like Urdu news paper and magazines. It is useful in converting the books of Urdu in digital format, the large amount of useful and heritage data in Urdu language which are in vanishing form can be saved in digital format. It can produce electronic books and digital Urdu library online.

Blog Ref:

Friday, November 13, 2009

How to conquer the problem of fragmented information and help visitors to find the appropriate information?

Introduction :

Web sites enclose a constantly growing quantity of information within their pages. As the quantity of information increases so does the convolution of the structure of the web site. Accordingly it has become difficult for visitors to find the information relevant to their needs. There have been numerous clustering methods being proposed to cluster data in an effort to help visitors find the relevant information. Mainly these clustering methods typically focused either on the content or the context of the web pages.

In adjunct, the web pages which are generated by traditional search engines do not contain synonymous terms in place of the terms used in the search query. For example, finding a word ‘car’ in web pages by using a traditional search engine will not find those web pages that always use the word ‘automobile’ instead. Also, web pages that represent the concept being hunted may not include the search terms. Another example may come if we use the search term ‘astronomy’ and the search engine does not display those web pages about the solar system because they don’t contain the word ‘astronomy’.

Solution :

I believe that to overcome with that we need to automate the information found on internet through the use of machine learning techniques I believe that combining self organizing maps (SOM) and WordNet will be very useful in order to overcome with that problem.

What is WordNet ?
We can say that WordNet separate words into the class of nouns, verbs, adjectives, and adverbs. WordNet tries to categorize the information according to their meanings of words rather than categorizing in the forms of the words. WordNet always encloses the standard information found in dictionaries and thesauri. One of the best features of WordNet is its information regarding the associations among words; the most vital among these is hypernym.

What is SOM ?
Samuel Kaski says in his paper "Creating an order in digital libraries with Self Organizing
"The SOM is a general supervised tool for ordering high-dimensional statistical data so that alike inputs are in general mapped close to each other. To utilize the SOM on texts, a document might, for example, be represented as the histogram of its words. A more practical method is to first use the so called semantic SOM for word categorization. The semantic SOM organized the words into grammatical and semantic categories represented on a two-dimensional array. The relative similarity of the categories reflected in their distance relationships on the array. An extra benefit from the use of category histograms instead of simple word histograms is that the dimensionality of the input to the document map is reduced by an order of magnitude."

The input for SOM will be generated by WordNet and it will be in the form vector based on zeros and ones and it is said earlier that the size of vectors which comes as an input should be equal to the number of unique replacement terms. The output is usually a two dimensional grid of nodes.

Algorithm :
Daniel X.Pape says :

The simplest description of the SOM algorithm is this:
The reference vectors contained in all the nodes are randomly initialized.
Step 1: An input vector is randomly selected from the input set.
Step 2: Using some metric (Euclidean, Cosine, etc) the input vector is compared to _every_ node's reference vector.
Step 3: The node whose reference vector is the best match (in that metric) is chosen as the winning node for that particular input vector.
Step 4: The neighbouring nodes (nodes which are topographically close in the array) to the winning node are then updated by a certain amount. This update simply changes the properties of the reference vectors by a small amount so that they are more similar to the input vector.
Step 5: Go to step 1.

Overall Process :

By getting the result from step 3 we can get the websites according to the search query.