WIT Press


Multi-media Based Web Mining For An Information Resource

Price

Free (open access)

Volume

28

Pages

Published

2002

Size

941 kb

Paper DOI

10.2495/DATA020861

Copyright

WIT Press

Author(s)

H S Tan & S E George

Abstract

In this paper we examine multi-media based web mining with view to establishing a web-based information resource that is able to retrieve documents based upon multi-media features within the document. A multi-media based ‘search engine’ potentially far exceeds what could be achieved from a purely textual one, but before such a comprehensive engine could be developed, a data mining process is required to extract appropriate features from unstructured web documents. Firstly, this paper discusses the need for visualization of search results and describes some user interface work that has been conducted towards improving the linear list presentation that is commonly found for presenting search results. Secondly, the principles of the self-organising feature map (SOM) are described including the method of training before providing examples of how the SOM can be used for visualization and how it has been used in various search-related applications. The SOM provides a topological ordering of the clustering. Thirdly, we present how the SOM has been used to present the results of data organisation using the clustering ability of the SOM, operating with appropriate features extracted from the web-pages. The extraction of appropriate features from documents that can then be used to ‘index’ the document within the multi-media search space. 1 Introduction This paper presents a novel visualization method for web-based search in the medical domain, and uses an artificial neural network to do so. It builds upon existing work in visualization for retrieval tasks which has so far been confined to documents of a specific genre. In particular, the genre of research publication has been utilised, where there is a well defined structure to the document (that importantly includes title and abstract). Typically, keywords from the title or

Keywords