Skip to main content

Research Repository

Advanced Search

Parafusion-extended: large scale paraphrase dataset integrating lexico-phrasal knowledge.

Jayawardena, Lasal; Yapa, Prasan

Authors

Lasal Jayawardena

Prasan Yapa



Contributors

Kun Zhou
Editor

Abstract

Paraphrasing, the art of rephrasing text while retaining its original meaning, lies at the core of natural language understanding and generation. With the rise of demand for more domain-specialized models, high-quality data is more valued than ever; this includes paraphrasing. ParaFusion-Extend (PFE) is a large-scale dataset driven by Large Language Models incorporating lexical and phrasal knowledge. The dataset is curated to contain high-quality diverse paraphrase pairs and also separate knowledge bases that could be used for research work and data augmentation models. We show that PFE offers around at least a 30% increase in syntactic and lexical diversity compared to the original data sources that are commonly used. We demonstrate the effectiveness of PFE on several downstream tasks such as few-shot learning and training on sentence embeddings. We utilize a gold-standard evaluation scheme, which is further strengthened by human evaluation that shows the potential of PFE in advancing paraphrase generation.

Citation

JAYAWARDENA, L. and YAPA, P. 2024. Parafusion-extended: large scale paraphrase dataset integrating lexico-phrasal knowledge. In Zhou, K. (ed.) Computational and experimental simulations in engineering: proceedings of the 30th International conference on computational and experimental engineering and sciences 2024 (ICCES 2024), 3-6 August 2024, Singapore. Mechanisms and machine science, 173. Cham: Springer [online], volume 2, pages 258-281. Available from: https://doi.org/10.1007/978-3-031-77489-8_20

Presentation Conference Type Conference Paper (published)
Conference Name 30th International conference on computational and experimental engineering and sciences 2024 (ICCES 2024)
Start Date Aug 3, 2024
End Date Aug 6, 2024
Acceptance Date Mar 15, 2024
Online Publication Date Dec 3, 2024
Publication Date Dec 3, 2024
Deposit Date Apr 24, 2025
Publicly Available Date Dec 4, 2025
Print ISSN 2211-0984
Electronic ISSN 2211-0992
Publisher Springer
Peer Reviewed Peer Reviewed
Volume 2
Pages 258-281
Series Title Mechanisms and machine science
Series Number 173
Series ISSN 2211-0984
Book Title Computational and experimental simulations in engineering
ISBN 9783031774881; 9783031774911
DOI https://doi.org/10.1007/978-3-031-77489-8_20
Keywords Paraphrase generation; Natural language processing; Natural language generation; Knowledge representation; Large language models; Data augmentation; Sentence embeddings; Few-shot learning
Public URL https://rgu-repository.worktribe.com/output/2801593

Files

This file is under embargo until Dec 4, 2025 due to copyright reasons.

Contact publications@rgu.ac.uk to request a copy for personal use.



You might also like



Downloadable Citations