WIT Press


Extracting People’s Names From RSS Feeds Using WordNet

Price

Free (open access)

Volume

33

Pages

10

Published

2004

Size

240 kb

Paper DOI

10.2495/DATA040041

Copyright

WIT Press

Author(s)

K. Durant & M. Smith

Abstract

using WordNet Department of Engineering and Applied Sciences Abstract Proper names are rich in content and it is essential we successfully identify them to provide an accurate representation of the news. Identifying people’s names is essential for article representation, representing the event of the story, identifying the actors and receivers of action in the story, and for identifying the trends happening within the news. Because of the importance names play, a name extraction algorithm must be accurate and precise. It must also be efficient because of the magnitude of the news corpora size and the timeliness of the data. Typical name extraction systems uses supervised learning to identify names. We take a different approach to name identification. We use WordNet to identify the popular names within the corpora. The algorithm tracks the unidentified words and uses standard templates to identify potential names within the unidentified words. We also address the problem of names being common words found within WordNet. We have created four gazetteers of words that may be a first name, last name, title or suffix of a name. We use these lists along with the surrounding text and simple templates to identify names. The algorithm can simultaneously be performed when mapping the words to the terms within the corpora. Our corpora are RSS news feeds, in particular the item element, which is a semantic representation of a news article. We identify the names found within the title and description elements of an RSS item. We exploit the fact that the title element and the description element have the same topic for a news article. Our algorithm has achieved a recall rate of 96% and a precision rate of 91%. We believe our approach performs well on this corpus because of the simple vocabulary and succinct writing published in newspaper headlines and lede statements. Extracting people’s names from RSS feeds Keywords: proper nouns, WordNet, RSS, gazetteer, template description. Harvard University, USA K. Durant & M. Smith

Keywords

proper nouns, WordNet, RSS, gazetteer, template description