A team of Olivet Institute of Technology graduates worked with Olivet University Research and Development Center to develop new text taxonomy and tagging technology that would help online newspapers categorize their contents more efficiently.
Traditionally for each article, news editors need to manually choose or write tags (keywords), attach those tags to the article’s metadata, and then place the article in the appropriate categories.
“It is tremendous work,” said a member of the team. “We saw the burden on editors, who often have hundreds of pieces to review and want to publish them as soon as possible.”
With assistance from OU’s R&D Center, OIT graduates developed a Named Entity Recognizer to extract entities such as people, organizations and locations, from articles. The team also researched Natural Language Processing and machine learning algorithms.
“We studied hundreds of thousands of news articles on the Web,” said another member of the team. “Thanks to many open source projects and data such as NLTK, Stanford NER, and Linked Open Data, we can train our identifier and examine the results.”
More than 20,000 articles and 10,000 topics and tags were included in the training dataset.
The technology is still in its beta stage.