A review of an article : “An approach to determining semantic similarity (ADSS)” : Advances in Engineering Software 37 (2006) 129–132
Enabling machines to extract the meaning of the content of documents on the web is crucial to the improvement of search engines. The better a machine can make connections between documents the easier it is to make connections between related documents, for example to cluster search results. For example a search for “Tolerance”, might well result in clustered results with the social, immunological, engineering, and legal meanings of the term separated to easily enable users to home in on information of interest. Common current technologies to enable such behaviour reply on manual tagging, or special mark-up of information. The work under review presents a method for determining if documents are linked in terms of their meaning. The method is an application of the Tabu search method which has been augmented by additional algorithms.
Tabu search
Tabu search is a mathematical optimization method, belonging to the class of local search techniques. Tabu search enhances the performance of a local search method by using memory structures. Tabu search is generally attributed to Fred Glover.
Tabu search uses a local or neighbourhood search procedure to iteratively move from a solution x to a solution x’ in the neighbourhood of x, until some stopping criterion has been satisfied. To explore regions of the search space that would be left unexplored by the local search procedure and—by doing this—escape local optimality, tabu search modifies the neighbourhood structure of each solution as the search progresses. The solutions admitted to N * (x), the new neighbourhood, are determined through the use of special memory structures. The search now progresses by iteratively moving from a solution x to a solution x’ in N * (x).
Perhaps the most important type of short-term memory to determine the solutions in N * (x), also the one that gives its name to tabu search, is the use of a tabu list. In its simplest form, a tabu list contains the solutions that have been visited in the recent past (less than n moves ago, where n is the tabu tenure). Solutions in the tabu list are excluded from N * (x). Other tabu list structures prohibit solutions that have certain attributes (e.g. traveling salesman problem (TSP) solutions that include certain arcs) or prevent certain moves (e.g. an arc that was added to a TSP tour cannot be removed in the next n moves). Selected attributes in solutions recently visited are labelled tabu-active. Solutions that contain tabu-active elements are tabu. This type of short-term memory is also called recency-based memory.
Tabu lists containing attributes are much more effective, although they raise a new problem. With forbidding an attribute as tabu, typically more than one solution is declared as tabu. Some of these solutions that must now be avoided might be of excellent quality and have not yet been visited. To overcome this problem, aspiration criteria are introduced which allow overriding the tabu state of a solution and thus include it in the allowed set. A commonly used aspiration criterion is to allow solutions which are better than the currently best known solution.
The above discussion of Tabu Search is licensed under the GNU Free Documentation License Source: http://en.wikipedia.org/wiki/Tabu_search.
The ADSS Approach
The ITSTDSS (Introduce Tabu Search to Determining Semantic Similarity) algorithm is proposed in the paper under review to determine the semantic similarity between documents. The algorithm described and showed to function with reasonable precision (better than an alternative algorithm used as a benchmark). The testing is done with only 50 documents which are to be assigned to a set of families (ontologies). The kind of approach described here is applicable for ranking documents in terms of how well they fit in a certain family - the approach would have a potential dual role in future search engine technology both in identifying families/clusters/ontologies as well as assigning rankings within them.
The computational power required to adopt such approaches on a large commercial search engine is a potential hurdle to adoption. As documents on the web are increasingly having semantic mark up applied either automatically by those involved indexing or during document creation where there the best opportunity to accurately and authoritatively tag documents as members of certain families is found.
Sci7’s web optimisation advice is based on observation of current and emerging trends, as well as published and internal research.