Context-aware data-to-text generation.

Upadhyay, Ashish

doi:10.48526/rgu-wt-2571408

Abstract

Data-to-Text Generation (D2T) is the subfield of Artificial Intelligence (AI) and Natural Language Processing (NLP) that aims to build systems capable of summarising nonlinguistic structured data into textual reports. D2T systems extract important insights from domain specific, mostly numerical, data and convey them in natural language reports that are more accessible to humans. This technology can help professionals in reducing their time spent on repetitive paperwork and allow them to focus on more important aspects of their jobs. Literature presents two main approaches to building D2T systems: first, the rule-based approach, which uses domain specific rules with hand-crafted templates to produce the summaries; and second, the neural-based approach, which utilises a sequence-to-sequence learning method to learn the domain rules and text generation from parallel corpus of input data and output summaries. The rule-based systems are able to produce high quality accurate summaries, but produce monotonous summaries and also are difficult to scale. Neural systems, on other hand, promise scalability across problems and domains (given the availability of parallel corpus for training) and are able to produce fluent, diverse and human-looking texts. However, they struggle in maintaining high-quality accuracy and hallucinate with incorrect generations unsupported by input data. Some D2T problems can be seen as a stream of time-stamped events with a textual summary written for each - for example, daily weather forecasts or regular sporting events. Human-authored event summaries often contain temporal contextual information, where the presented information is derived from the data of other previously occurring events, while in other situations the summary content is influenced by the environmental context in which the events take place. Current state-of-the-art systems do not incorporate such contextual information while producing the event summaries, making these summaries less interesting and lacking detail compared to the human-authored summaries. In this thesis, we present several key methods for a D2T system pipeline that addresses temporal and environmental contextual problems. We first present a dynamic template method for D2T, called CBR-D2T, which helps in mitigating the accuracy and diversity trade-off between neural and rule-based systems. Empirical evaluations on a sports domain dataset suggest that CBR-D2T is able to achieve 6% better content accuracy than a neural benchmark while also maintaining better fluency and diversity than a template baseline. We then present a content type typology of D2T problems that is used to profile D2T datasets based on the different level of complexity present in the event summaries. This method uses the event representation to dynamically select a set of templates and organises them based on a pre-defined plan to generate accurate yet diverse event summary. We also present a CBR-based context-aware content planning method, CBR-Plan, which uses the environmental context of an event to produce a content plan for an event summary. The content plans produced from this method more closely imitate the content plans in human-authored summaries in terms of the different types of information discussed, achieving 10% correlation with content plans in human-authored summaries than the content plans from summaries generated by neural benchmarks. Finally, we present a context-aware hybrid text generation method that utilises important temporal contextual data selected from past related events to produce a contextually aware event summary. This method utilises the content plan produced from CBR-Plan to build an input sequence with important data from current as well as past events. The input sequence is then used by a state-of-the-art neural network to generate an initial contextual summary, which is then post-edited with the CBR-D2T method to improve the accuracy of past-event information communicated in the summary. The user study conducted to measure the accuracy of our proposed method found that further postediting the neural summary improves the content accuracy by more than double.

Thesis Type	Thesis
Deposit Date	Nov 6, 2024
Publicly Available Date	Nov 6, 2024
DOI	https://doi.org/10.48526/rgu-wt-2571408
Keywords	Text generation; Data-to-text; Neural systems; Case-based reasoning (CBR); Natural language processing (NLP); Artificial intelligence (AI)
Public URL	https://rgu-repository.worktribe.com/output/2571408
Award Date	May 31, 2024

Context-aware data-to-text generation.

Upadhyay, Ashish

Authors

Contributors

Abstract

Citation

Files

You might also like

Downloadable Citations