An Empirical Cybersecurity Evaluation of GitHub Copilot's Code Contributions
release_rfjrdu6lrjb3jg5mnpbs3ee6ba
by
Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, Ramesh Karri
2021
Abstract
There is burgeoning interest in designing AI-based systems to assist humans
in designing computing systems, including tools that automatically generate
computer code. The most notable of these comes in the form of the first
self-described `AI pair programmer', GitHub Copilot, a language model trained
over open-source GitHub code. However, code often contains bugs - and so, given
the vast quantity of unvetted code that Copilot has processed, it is certain
that the language model will have learned from exploitable, buggy code. This
raises concerns on the security of Copilot's code contributions. In this work,
we systematically investigate the prevalence and conditions that can cause
GitHub Copilot to recommend insecure code. To perform this analysis we prompt
Copilot to generate code in scenarios relevant to high-risk CWEs (e.g. those
from MITRE's "Top 25" list). We explore Copilot's performance on three distinct
code generation axes -- examining how it performs given diversity of
weaknesses, diversity of prompts, and diversity of domains. In total, we
produce 89 different scenarios for Copilot to complete, producing 1,692
programs. Of these, we found approximately 40% to be vulnerable.
In text/plain
format
Archived Files and Locations
application/pdf 613.4 kB
file_mtehewq5ajgq5pxixyp5urqigm
|
arxiv.org (repository) web.archive.org (webarchive) |
2108.09293v2
access all versions, variants, and formats of this works (eg, pre-prints)