Friday, November 13, 2009

How to conquer the problem of fragmented information and help visitors to find the appropriate information?

Introduction :

Web sites enclose a constantly growing quantity of information within their pages. As the quantity of information increases so does the convolution of the structure of the web site. Accordingly it has become difficult for visitors to find the information relevant to their needs. There have been numerous clustering methods being proposed to cluster data in an effort to help visitors find the relevant information. Mainly these clustering methods typically focused either on the content or the context of the web pages.

In adjunct, the web pages which are generated by traditional search engines do not contain synonymous terms in place of the terms used in the search query. For example, finding a word ‘car’ in web pages by using a traditional search engine will not find those web pages that always use the word ‘automobile’ instead. Also, web pages that represent the concept being hunted may not include the search terms. Another example may come if we use the search term ‘astronomy’ and the search engine does not display those web pages about the solar system because they don’t contain the word ‘astronomy’.

Solution :

I believe that to overcome with that we need to automate the information found on internet through the use of machine learning techniques I believe that combining self organizing maps (SOM) and WordNet will be very useful in order to overcome with that problem.

What is WordNet ?
We can say that WordNet separate words into the class of nouns, verbs, adjectives, and adverbs. WordNet tries to categorize the information according to their meanings of words rather than categorizing in the forms of the words. WordNet always encloses the standard information found in dictionaries and thesauri. One of the best features of WordNet is its information regarding the associations among words; the most vital among these is hypernym.

What is SOM ?
Samuel Kaski says in his paper "Creating an order in digital libraries with Self Organizing
Map"
"The SOM is a general supervised tool for ordering high-dimensional statistical data so that alike inputs are in general mapped close to each other. To utilize the SOM on texts, a document might, for example, be represented as the histogram of its words. A more practical method is to first use the so called semantic SOM for word categorization. The semantic SOM organized the words into grammatical and semantic categories represented on a two-dimensional array. The relative similarity of the categories reflected in their distance relationships on the array. An extra benefit from the use of category histograms instead of simple word histograms is that the dimensionality of the input to the document map is reduced by an order of magnitude."

The input for SOM will be generated by WordNet and it will be in the form vector based on zeros and ones and it is said earlier that the size of vectors which comes as an input should be equal to the number of unique replacement terms. The output is usually a two dimensional grid of nodes.

Algorithm :
Daniel X.Pape says :

The simplest description of the SOM algorithm is this:
The reference vectors contained in all the nodes are randomly initialized.
Step 1: An input vector is randomly selected from the input set.
Step 2: Using some metric (Euclidean, Cosine, etc) the input vector is compared to _every_ node's reference vector.
Step 3: The node whose reference vector is the best match (in that metric) is chosen as the winning node for that particular input vector.
Step 4: The neighbouring nodes (nodes which are topographically close in the array) to the winning node are then updated by a certain amount. This update simply changes the properties of the reference vectors by a small amount so that they are more similar to the input vector.
Step 5: Go to step 1.

Overall Process :

By getting the result from step 3 we can get the websites according to the search query.