Lasal Jayawardena
Parafusion-extended: large scale paraphrase dataset integrating lexico-phrasal knowledge.
Jayawardena, Lasal; Yapa, Prasan
Authors
Prasan Yapa
Contributors
Kun Zhou
Editor
Abstract
Paraphrasing, the art of rephrasing text while retaining its original meaning, lies at the core of natural language understanding and generation. With the rise of demand for more domain-specialized models, high-quality data is more valued than ever; this includes paraphrasing. ParaFusion-Extend (PFE) is a large-scale dataset driven by Large Language Models incorporating lexical and phrasal knowledge. The dataset is curated to contain high-quality diverse paraphrase pairs and also separate knowledge bases that could be used for research work and data augmentation models. We show that PFE offers around at least a 30% increase in syntactic and lexical diversity compared to the original data sources that are commonly used. We demonstrate the effectiveness of PFE on several downstream tasks such as few-shot learning and training on sentence embeddings. We utilize a gold-standard evaluation scheme, which is further strengthened by human evaluation that shows the potential of PFE in advancing paraphrase generation.
Citation
JAYAWARDENA, L. and YAPA, P. 2024. Parafusion-extended: large scale paraphrase dataset integrating lexico-phrasal knowledge. In Zhou, K. (ed.) Computational and experimental simulations in engineering: proceedings of the 30th International conference on computational and experimental engineering and sciences 2024 (ICCES 2024), 3-6 August 2024, Singapore. Mechanisms and machine science, 173. Cham: Springer [online], volume 2, pages 258-281. Available from: https://doi.org/10.1007/978-3-031-77489-8_20
Presentation Conference Type | Conference Paper (published) |
---|---|
Conference Name | 30th International conference on computational and experimental engineering and sciences 2024 (ICCES 2024) |
Start Date | Aug 3, 2024 |
End Date | Aug 6, 2024 |
Acceptance Date | Mar 15, 2024 |
Online Publication Date | Dec 3, 2024 |
Publication Date | Dec 3, 2024 |
Deposit Date | Apr 24, 2025 |
Publicly Available Date | Dec 4, 2025 |
Print ISSN | 2211-0984 |
Electronic ISSN | 2211-0992 |
Publisher | Springer |
Peer Reviewed | Peer Reviewed |
Volume | 2 |
Pages | 258-281 |
Series Title | Mechanisms and machine science |
Series Number | 173 |
Series ISSN | 2211-0984 |
Book Title | Computational and experimental simulations in engineering |
ISBN | 9783031774881; 9783031774911 |
DOI | https://doi.org/10.1007/978-3-031-77489-8_20 |
Keywords | Paraphrase generation; Natural language processing; Natural language generation; Knowledge representation; Large language models; Data augmentation; Sentence embeddings; Few-shot learning |
Public URL | https://rgu-repository.worktribe.com/output/2801593 |
Files
This file is under embargo until Dec 4, 2025 due to copyright reasons.
Contact publications@rgu.ac.uk to request a copy for personal use.
You might also like
AlignLLM: alignment-based evaluation using ensemble of LLMs-as-judges for Q &A.
(2025)
Presentation / Conference Contribution
Context driven multi-query resolution using LLM-RAG to support the revision of explainability needs.
(2025)
Presentation / Conference Contribution
Downloadable Citations
About OpenAIR@RGU
Administrator e-mail: publications@rgu.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search