Classifying XML documents by using genre features.

Clark, Malcolm; Watt, Stuart

doi:10.1109/DEXA.2007.120

Classifying XML documents by using genre features.

Clark, Malcolm; Watt, Stuart

Authors

Malcolm Clark

Stuart Watt

Contributors

A.M. Tjoa
Editor

R.R. Wagner
Editor

Abstract

The categorization of documents is traditionally topic-based. This paper presents a complementary analysis of research and experiments on genre to show that encouraging results can be obtained by using genre structure (form) features. We conducted an experiment to assess the effectiveness of using extensible mark-up language (XML) tag information, and part-of-speech (P-O-S) features, for the classification of genres, testing the hypothesis that if a focus on genre can lead to high precision on normal textual documents, then good results can be achieved using XML tag information in addition to P-O-S information. An experiment was carried out on a subsection of the initiative for the evaluation of XML (INEX) 1.4 collection. The features were extracted and documents were classified using machine learning algorithms, which yielded encouraging results for logistic regression and neural networks. We propose that utilizing these features and training a classifier may benefit retrieval for most world wide web (WWW) technologies such as XML and extensible hypertext markup language) XHTML.

Citation

CLARK, M. and WATT, S. 2007. Classifying XML documents by using genre features. In Tjoa, A.M. and Wagner, R.R. (eds.) Proceedings of the 18th International workshop on database and expert systems applications (DEXA 2007), 3-7 September 2007, Regensburg, Germany. Los Alamitos: IEEE Computer Society [online], article number 4312894, pages 242-248. Available from: https://doi.org/10.1109/DEXA.2007.120

Presentation Conference Type	Conference Paper (published)
Conference Name	18th International workshop on database and expert systems applications (DEXA 2007)
Start Date	Sep 3, 2007
End Date	Sep 7, 2007
Acceptance Date	Sep 30, 2007
Online Publication Date	Sep 30, 2007
Publication Date	Sep 30, 2007
Deposit Date	Mar 11, 2015
Publicly Available Date	Mar 11, 2015
Print ISSN	1529-4188
Electronic ISSN	2378-3915
Publisher	IEEE Computer Society
Peer Reviewed	Peer Reviewed
Article Number	4312894
Pages	242-248
Series Title	Proceedings of the international workshop on database and expert systems applications
Series ISSN	2378-3915
ISBN	9780769529325
DOI	https://doi.org/10.1109/DEXA.2007.120
Keywords	XML; Testing; Feature extraction; Data mining; Machine learning algorithms; Logistics; Neural networks; Web sites; World Wide Web; Markup languages
Public URL	http://hdl.handle.net/10059/1157
Contract Date	Mar 11, 2015

Files

CLARK 2007 Classifying XML documents (263 Kb)
PDF

Publisher Licence URL
https://creativecommons.org/licenses/by-nc-nd/4.0/

Downloadable Citations

HTML

BIB

RTF