ASHISH UPADHYAY a.upadhyay@rgu.ac.uk
Completed Research Student
ASHISH UPADHYAY a.upadhyay@rgu.ac.uk
Completed Research Student
Dr Stewart Massie s.massie@rgu.ac.uk
Supervisor
Professor Nirmalie Wiratunga n.wiratunga@rgu.ac.uk
Supervisor
Data-to-Text Generation (D2T) is the subfield of Artificial Intelligence (AI) and Natural Language Processing (NLP) that aims to build systems capable of summarising nonlinguistic structured data into textual reports. D2T systems extract important insights from domain specific, mostly numerical, data and convey them in natural language reports that are more accessible to humans. This technology can help professionals in reducing their time spent on repetitive paperwork and allow them to focus on more important aspects of their jobs. Literature presents two main approaches to building D2T systems: first, the rule-based approach, which uses domain specific rules with hand-crafted templates to produce the summaries; and second, the neural-based approach, which utilises a sequence-to-sequence learning method to learn the domain rules and text generation from parallel corpus of input data and output summaries. The rule-based systems are able to produce high quality accurate summaries, but produce monotonous summaries and also are difficult to scale. Neural systems, on other hand, promise scalability across problems and domains (given the availability of parallel corpus for training) and are able to produce fluent, diverse and human-looking texts. However, they struggle in maintaining high-quality accuracy and hallucinate with incorrect generations unsupported by input data. Some D2T problems can be seen as a stream of time-stamped events with a textual summary written for each - for example, daily weather forecasts or regular sporting events. Human-authored event summaries often contain temporal contextual information, where the presented information is derived from the data of other previously occurring events, while in other situations the summary content is influenced by the environmental context in which the events take place. Current state-of-the-art systems do not incorporate such contextual information while producing the event summaries, making these summaries less interesting and lacking detail compared to the human-authored summaries. In this thesis, we present several key methods for a D2T system pipeline that addresses temporal and environmental contextual problems. We first present a dynamic template method for D2T, called CBR-D2T, which helps in mitigating the accuracy and diversity trade-off between neural and rule-based systems. Empirical evaluations on a sports domain dataset suggest that CBR-D2T is able to achieve 6% better content accuracy than a neural benchmark while also maintaining better fluency and diversity than a template baseline. We then present a content type typology of D2T problems that is used to profile D2T datasets based on the different level of complexity present in the event summaries. This method uses the event representation to dynamically select a set of templates and organises them based on a pre-defined plan to generate accurate yet diverse event summary. We also present a CBR-based context-aware content planning method, CBR-Plan, which uses the environmental context of an event to produce a content plan for an event summary. The content plans produced from this method more closely imitate the content plans in human-authored summaries in terms of the different types of information discussed, achieving 10% correlation with content plans in human-authored summaries than the content plans from summaries generated by neural benchmarks. Finally, we present a context-aware hybrid text generation method that utilises important temporal contextual data selected from past related events to produce a contextually aware event summary. This method utilises the content plan produced from CBR-Plan to build an input sequence with important data from current as well as past events. The input sequence is then used by a state-of-the-art neural network to generate an initial contextual summary, which is then post-edited with the CBR-D2T method to improve the accuracy of past-event information communicated in the summary. The user study conducted to measure the accuracy of our proposed method found that further postediting the neural summary improves the content accuracy by more than double.
UPADHYAY, A. 2024. Context-aware data-to-text generation. Robert Gordon University, PhD thesis. Hosted on OpenAIR [online]. Available from: https://doi.org/10.48526/rgu-wt-2571408
Thesis Type | Thesis |
---|---|
Deposit Date | Nov 6, 2024 |
Publicly Available Date | Nov 6, 2024 |
DOI | https://doi.org/10.48526/rgu-wt-2571408 |
Keywords | Text generation; Data-to-text; Neural systems; Case-based reasoning (CBR); Natural language processing (NLP); Artificial intelligence (AI) |
Public URL | https://rgu-repository.worktribe.com/output/2571408 |
Award Date | May 31, 2024 |
UPADHYAY 2024 Context-aware data-to-text
(9.3 Mb)
PDF
Licence
https://creativecommons.org/licenses/by-nc/4.0/
Copyright Statement
© The Author.
WEC: weighted ensemble of text classifiers.
(2020)
Presentation / Conference Contribution
Case-based approach to automated natural language generation for obituaries.
(2020)
Presentation / Conference Contribution
GEMv2: multilingual NLG benchmarking in a single line of code.
(2022)
Presentation / Conference Contribution
A case-based approach to data-to-text generation.
(2021)
Presentation / Conference Contribution
A case-based approach for content planning in data-to-text generation.
(2022)
Presentation / Conference Contribution
About OpenAIR@RGU
Administrator e-mail: publications@rgu.ac.uk
This application uses the following open-source libraries:
Apache License Version 2.0 (http://www.apache.org/licenses/)
Apache License Version 2.0 (http://www.apache.org/licenses/)
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search