WIT Press


Study Of Category Score Algorithms For K – NN Classifier

Price

Free (open access)

Volume

28

Pages

Published

2002

Size

543 kb

Paper DOI

10.2495/DATA020361

Copyright

WIT Press

Author(s)

H Kou & G Gardarin

Abstract

Study of category score algorithms for k-NN classifier H. Kou1, G. Gardarin1, 2 1 PRiSM Laboratory, University of Versailles Saint-Quentin, France 2 e-xmlmedia Inc., France Abstract In the context of document categorization, this paper analyzes category score algorithms for k-NN classifier found in the literature, including majority voting algorithm (MVA), simple sum algorithm (SSA). MVA and SSA are the two mainly used algorithms to estimate the score for candidate categories in k-NN classifier systems. Based on the hypothesis that utilization of relation between documents and categories could improve system performance, we propose two new weighting calculation algorithm of category score: a concept-based weighting (CBW) score algorithm and a term independence-based weighting (IBW) score algorithm. Our experimental results confirm our hypothesis and show that in terms of precision average IBW and CBW are better than the other score algorithms while SSA is higher than MVA. According to macro-average F 1 CBW performs best. The Rocchio-based algorithm (RBA) always performs worst. 1 Introduction Document categorization is the procedure of assigning one or multiple predefine category labels to a free text document. It is a useful component of various language processing applications. A primary application of text categorization is to assign subject categories to documents to support information retrieval. It can ease the organization of increasing textual information, in particular Web pages and other electronic form documents. Many of the document categorization algorithms have been widely investigated, including k-nearest neighbor (k-NN) algorithms (Yang [1], [13], [3]), naive Bayes algorithms [6], Rocchio algorithms [14]. Among them k-NN is one of the top-performing classifiers [12] [13] [2] and is comparable to the most effective support vector machine algorithm reported in [2]. The k-NN algorithm is based on the assumption that the classification of an instance is most similar to the classification of other nearby instances. The main ideas of k-NN is to find some top nearest neighbors and then estimate the score for every category by using the category membership

Keywords