Skip to main navigation Skip to search Skip to main content

Generating realistic vulnerabilities via neural code editing: an empirical study

  • Yu Nong
  • , Yuzhe Ou
  • , Michael Pradel
  • , Feng Chen
  • , Haipeng Cai
  • Washington State University Pullman
  • University of Texas at Dallas
  • University of Stuttgart

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

23 Scopus citations

Abstract

The availability of large-scale, realistic vulnerability datasets is essential both for benchmarking existing techniques and for developing effective new data-driven approaches for software security. Yet such datasets are critically lacking. A promising solution is to generate such datasets by injecting vulnerabilities into real-world programs, which are richly available. Thus, in this paper, we explore the feasibility of vulnerability injection through neural code editing. With a synthetic dataset and a real-world one, we investigate the potential and gaps of three state-of-the-art neural code editors for vulnerability injection. We find that the studied editors have critical limitations on the real-world dataset, where the best accuracy is only 10.03%, versus 79.40% on the synthetic dataset. While the graph-based editors are more effective (successfully injecting vulnerabilities in up to 34.93% of real-world testing samples) than the sequence-based one (0 success), they still suffer from complex code structures and fall short for long edits due to their insufficient designs of the preprocessing and deep learning (DL) models. We reveal the promise of neural code editing for generating realistic vulnerable samples, as they help boost the effectiveness of DL-based vulnerability detectors by up to 49.51% in terms of F1 score. We also provide insights into the gaps in current editors (e.g., they are good at deleting but not at replacing code) and actionable suggestions for addressing them (e.g., designing effective editing primitives).

Original languageEnglish
Title of host publicationESEC/FSE 2022 - Proceedings of the 30th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering
EditorsAbhik Roychoudhury, Cristian Cadar, Miryung Kim
PublisherAssociation for Computing Machinery, Inc
Pages1097-1109
Number of pages13
ISBN (Electronic)9781450394130
DOIs
StatePublished - Nov 7 2022
Event30th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022 - Singapore, Singapore
Duration: Nov 14 2022Nov 18 2022

Publication series

NameESEC/FSE 2022 - Proceedings of the 30th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Conference

Conference30th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022
Country/TerritorySingapore
CitySingapore
Period11/14/2211/18/22

Keywords

  • benchmarking
  • data augmentation
  • data generation
  • datasets
  • deep learning
  • software vulnerability
  • vulnerability detection

Fingerprint

Dive into the research topics of 'Generating realistic vulnerabilities via neural code editing: an empirical study'. Together they form a unique fingerprint.

Cite this