INTINN: Automatic XML markup of text documents

 

Funded by:
Enterprise Ireland


Project Leaders:
John Dunnion, Joe Carthy, Dr. Nick Kushmerick (Dept of Computer Science, University College Dublin)


Principal Researcher:
Shazia Akhtar


Supervisors:
Prof. Ronan Reilly (Dept of Computer Science, NUI Maynooth)
John Dunnion (Dept of Computer Science, University College Dublin)


Description:
XML markup system is fully automatic, it is inspired by the WEBSOM method and a machine learning algorithm C5/See5. By using WEBSOM method, the system clusters the marked-up documents such that semantically similar documents lie close together on a Self-Organizing Map (SOM). The system then employs an inductive learning algorithm (C5/See5) to automatically learn and apply mark-up rules from the nearest SOM neighbours of an unmarked document. The system has a learning behaviour, it learns from mark-up errors in order to improve accuracy. The automatically marked-up documents produced by the system lie on a Self-Organizing Map, for better management and retrieval.