P. Deepak
Two-part segmentation of text documents.
Deepak, P.; Visweswariah, Karthik; Wiratunga, Nirmalie; Sani, Sadiq
Authors
Karthik Visweswariah
Professor Nirmalie Wiratunga n.wiratunga@rgu.ac.uk
Associate Dean for Research
Sadiq Sani
Abstract
We consider the problem of segmenting text documents that have a two-part structure such as a problem part and a solution part. Documents of this genre include incident reports that typically involve description of events relating to a problem followed by those pertaining to the solution that was tried. Segmenting such documents into the component two parts would render them usable in knowledge reuse frameworks such as Case-Based Reasoning. This segmentation problem presents a hard case for traditional text segmentation due to the lexical inter-relatedness of the segments. We develop a two-part segmentation technique that can harness a corpus of similar documents to model the behavior of the two segments and their inter-relatedness using language models and translation models respectively. In particular, we use separate language models for the problem and solution segment types, whereas the interrelatedness between segment types is modeled using an IBM Model 1 translation model. We model documents as being generated starting from the problem part that comprises of words sampled from the problem language model, followed by the solution part whose words are sampled either from the solution language model or from a translation model conditioned on the words already chosen in the problem part. We show, through an extensive set of experiments on real-world data, that our approach outperforms the state-of-the-art text segmentation algorithms in the accuracy of segmentation, and that such improved accuracy translates well to improved usability in Case-based Reasoning systems. We also analyze the robustness of our technique to varying amounts and types of noise and empirically illustrate that our technique is quite noise tolerant, and degrades gracefully with increasing amounts of noise.
Citation
DEEPAK, P., VISWESWARIAH, K., WIRATUNGA, N. and SANI, S. 2012. Two-part segmentation of text documents. In Proceedings of the 21st Association for Computing Machinery (ACM) International conference on information and knowledge management (CIKM'12), 29 October - 02 November 2012, Maui, USA. New York: ACM [online], pages 793-802. Available from: https://dx.doi.org/10.1145/2396761.2396862
Conference Name | 21st Association for Computing Machinery (ACM) International conference on information and knowledge management (CIKM'12) |
---|---|
Conference Location | Maui, USA |
Start Date | Oct 29, 2012 |
End Date | Nov 2, 2012 |
Acceptance Date | Oct 29, 2012 |
Online Publication Date | Oct 29, 2012 |
Publication Date | Oct 31, 2012 |
Deposit Date | Sep 21, 2016 |
Publicly Available Date | Sep 21, 2016 |
Publisher | Association for Computing Machinery (ACM) |
Pages | 793-802 |
DOI | https://doi.org/10.1145/2396761.2396862 |
Keywords | Text; Segmentation; Language models; Translation models |
Public URL | http://hdl.handle.net/10059/1830 |
Files
DEEPAK 2012 Two-part segmentation of text documents
(809 Kb)
PDF
Publisher Licence URL
https://creativecommons.org/licenses/by-nc-nd/4.0/
You might also like
Personalised human activity recognition using matching networks.
(2018)
Conference Proceeding
Improving human activity recognition with neural translator models.
(2018)
Conference Proceeding
Accuracy of physical activity recognition from a wrist-worn sensor.
(2017)
Presentation / Conference
Downloadable Citations
About OpenAIR@RGU
Administrator e-mail: publications@rgu.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2024
Advanced Search