Open Archives Data Service Prototype and Automated Subject Indexing Using D-Lib Archive Content As a Testbed (10 Dec 2003)
The results of computing keywords from text in the articles are quite interesting. We assigned binary relevance values to each term in the keyword list based on a fairly relaxed standard. The question we posed was not whether the keywords were the best terms to represent the meaning of the article. Our criterion was instead whether the term was a reasonable keyword for the article or a part of the article. Results varied from 70% to 95% based on that relaxed criteria.
There are some nonsense terms in the keyword set but there are many terms that do reflect the semantic meaning of the article. One interesting observation is that the keyword sets for each article have little in common with other article keyword sets. That was a bit of a surprise given the tight focus of D-Lib. Another interesting feature of the relevance analysis is that most of the non relevant keywords appeared at the top of each list. The keyword list is sorted by tf/idf weight with the highest values at the top of the list. This result suggests that tf/idf weighting may not be the best method of selecting terms. The tf/idf algorithm does a good job of selecting terms for a specific document but not necessarily a good term for all paragraphs in an article. Keep in mind that we split each article into multiple documents (paragraphs) for this computation.
Article URL: http://www.dlib.org/dlib/december03/mongin/12mongin.html
Read 41 more articles from D-Lib Magazine sorted by
Next Article: Origami help for tech design