Malcolm Clark
Classifying XML documents by using genre features.
Clark, Malcolm; Watt, Stuart
Authors
Stuart Watt
Contributors
A.M. Tjoa
Editor
R.R. Wagner
Editor
Abstract
The categorization of documents is traditionally topic-based. This paper presents a complementary analysis of research and experiments on genre to show that encouraging results can be obtained by using genre structure (form) features. We conducted an experiment to assess the effectiveness of using extensible mark-up language (XML) tag information, and part-of-speech (P-O-S) features, for the classification of genres, testing the hypothesis that if a focus on genre can lead to high precision on normal textual documents, then good results can be achieved using XML tag information in addition to P-O-S information. An experiment was carried out on a subsection of the initiative for the evaluation of XML (INEX) 1.4 collection. The features were extracted and documents were classified using machine learning algorithms, which yielded encouraging results for logistic regression and neural networks. We propose that utilizing these features and training a classifier may benefit retrieval for most world wide web (WWW) technologies such as XML and extensible hypertext markup language) XHTML.
Citation
CLARK, M. and WATT, S. 2007. Classifying XML documents by using genre features. In Tjoa, A.M. and Wagner, R.R. (eds.) Proceedings of the 18th International workshop on database and expert systems applications (DEXA 2007), 3-7 September 2007, Regensburg, Germany. Los Alamitos: IEEE Computer Society [online], article number 4312894, pages 242-248. Available from: https://doi.org/10.1109/DEXA.2007.120
Presentation Conference Type | Conference Paper (published) |
---|---|
Conference Name | 18th International workshop on database and expert systems applications (DEXA 2007) |
Start Date | Sep 3, 2007 |
End Date | Sep 7, 2007 |
Acceptance Date | Sep 30, 2007 |
Online Publication Date | Sep 30, 2007 |
Publication Date | Sep 30, 2007 |
Deposit Date | Mar 11, 2015 |
Publicly Available Date | Mar 11, 2015 |
Print ISSN | 1529-4188 |
Electronic ISSN | 2378-3915 |
Publisher | IEEE Computer Society |
Peer Reviewed | Peer Reviewed |
Article Number | 4312894 |
Pages | 242-248 |
Series Title | Proceedings of the international workshop on database and expert systems applications |
Series ISSN | 2378-3915 |
ISBN | 9780769529325 |
DOI | https://doi.org/10.1109/DEXA.2007.120 |
Keywords | XML; Testing; Feature extraction; Data mining; Machine learning algorithms; Logistics; Neural networks; Web sites; World Wide Web; Markup languages |
Public URL | http://hdl.handle.net/10059/1157 |
Contract Date | Mar 11, 2015 |
Files
CLARK 2007 Classifying XML documents
(263 Kb)
PDF
Publisher Licence URL
https://creativecommons.org/licenses/by-nc-nd/4.0/
Downloadable Citations
About OpenAIR@RGU
Administrator e-mail: publications@rgu.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search