Labelled Vulnerability Dataset on Android source code (LVDAndro) to develop AI-based code vulnerability detection models.

Senanayake, Janaka; Kalutarage, Harsha; Al-Kadri, Mhd Omar; Piras, Luca; Petrovski, Andrei

doi:10.5220/0012060400003555

Labelled Vulnerability Dataset on Android source code (LVDAndro) to develop AI-based code vulnerability detection models.

Senanayake, Janaka; Kalutarage, Harsha; Al-Kadri, Mhd Omar; Piras, Luca; Petrovski, Andrei

Authors

Dr Janaka Senanayake j.senanayake1@rgu.ac.uk
Lecturer

Dr Harsha Kalutarage h.kalutarage@rgu.ac.uk
Associate Professor

Mhd Omar Al-Kadri

Luca Piras

Andrei Petrovski

Contributors

Sabrina De Capitani di Vimercati
Editor

Pierangela Samarati
Editor

Abstract

Ensuring the security of Android applications is a vital and intricate aspect requiring careful consideration during development. Unfortunately, many apps are published without sufficient security measures, possibly due to a lack of early vulnerability identification. One possible solution is to employ machine learning models trained on a labelled dataset, but currently, available datasets are suboptimal. This study creates a sequence of datasets of Android source code vulnerabilities, named LVDAndro, labelled based on Common Weakness Enumeration (CWE). Three datasets were generated through app scanning by altering the number of apps and their sources. The LVDAndro, includes over 2,000,000 unique code samples, obtained by scanning over 15,000 apps. The AutoML technique was then applied to each dataset, as a proof of concept to evaluate the applicability of LVDAndro, in detecting vulnerable source code using machine learning. The AutoML model, trained on the dataset, achieved accuracy of 94% and F1-Score of 0.94 in binary classification, and accuracy of 94% and F1-Score of 0.93 in CWE-based multi-class classification. The LVDAndro dataset is publicly available, and continues to expand as more apps are scanned and added to the dataset regularly. The LVDAndro GitHub Repository also includes the source code for dataset generation, and model training.

Citation

SENANAYAKE, J., KALUTARAGE, H., AL-KADRI, M.O., PIRAS, L. and PETROVSKI, A. 2023. Labelled Vulnerability Dataset on Android source code (LVDAndro) to develop AI-based code vulnerability detection models. In De Capitani di Vimercati, S. and Samarati, P. (eds.) Proceedings of the 20th International conference on security and cryptography, 10-12 July 2023, Rome, Italy, volume 1. Setúbal: SciTePress [online], pages 659-666. Available from: https://doi.org/10.5220/0012060400003555

Presentation Conference Type	Conference Paper (published)
Conference Name	20th International conference on Security and cryptography 2023 (SECRYPT 2023)
Start Date	Jul 10, 2023
End Date	Jul 12, 2023
Acceptance Date	Apr 21, 2023
Online Publication Date	Jul 12, 2023
Publication Date	Dec 31, 2023
Deposit Date	Sep 7, 2023
Publicly Available Date	Sep 7, 2023
Publisher	SciTePress
Peer Reviewed	Peer Reviewed
Volume	1
Pages	659-666
Series ISSN	2184-7711
Book Title	Proceedings of the 20th International conference on Security and cryptography
ISBN	9789897586668
DOI	https://doi.org/10.5220/0012060400003555
Keywords	Android application security; Code vulnerability; Labelled dataset; Artificial intelligence; Auto machine learning
Public URL	https://rgu-repository.worktribe.com/output/2072016
Related Public URLs	https://rgu-repository.worktribe.com/output/2072071 (Related dataset link-only output)
Additional Information	Publisher preferred citation: Senanayake, J.; Kalutarage, H.; Al-Kadri, M.; Piras, L. and Petrovski, A. (2023). Labelled Vulnerability Dataset on Android Source Code (LVDAndro) to Develop AI-Based Code Vulnerability Detection Models. In Proceedings of the 20th International Conference on Security and Cryptography - SECRYPT; ISBN 978-989-758-666-8; ISSN 2184-7711, SciTePress, pages 659-666. DOI: 10.5220/0012060400003555