AI code wizards may not produce as many errors as feared • The registry

Machine learning models that support next-generation code completion tools like GitHub Copilot can help software developers write more functional code without making it less secure.

That’s the preliminary result of a small survey of 58 people conducted by a group of computer scientists from New York University.

In an article distributed via ArXiv, Gustavo Sandoval, Hammond Pearce, Teo Nys, Ramesh Karri, Brendan Dolan-Gavitt and Siddharth Garg share how they put the security of source code built using Large Language Models (LLMs) to the test have asked .

LLMs like the OpenAI GPT family have been trained with huge amounts of public text data or public source code in the case of OpenAI’s Codex, a GPT descendant and the foundation of GitHub’s Copilot. As such, they can reproduce mistakes made by human programmers in the past, exemplifying the “garbage in, garbage out” maxim. There was a fear that these tools would reappear and suggest bad code to developers that would incorporate the stuff into their projects.

In addition, code safety can be contextual: code that is safe in isolation may be unsafe when run in a specific order with other software. These auto-completion tools may provide code suggestions that are fine on their own, but when combined with other code are now vulnerable to attack or just plain broken. It turns out, however, that these tools don’t actually make people any worse at programming.


Googlers Showcase AI That Can Help Developers Protect Cryptocode From Side-Channel Attacks That Slurp Keys


In a way, the researchers were putting out their own fire. About a year ago, two of the same computer scientists contributed to an article titled “Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions”. This work revealed that approximately 40 percent of Copilot’s output contained potentially exploitable vulnerabilities (CWEs).

“The difference between the two articles is that ‘Asleep at the Keyboard’ looked at fully automated code generation (no human in the loop) and we didn’t have human users to compare against, so we couldn’t comment on how Copilot security versus human-written code security,” said Brendan Dolan-Gavitt, co-author of both articles and an assistant professor in the NYU Tandon Department of Computer Science and Engineering, in an email The registry.

“The user study paper attempts to address these missing pieces head-on, with half the users getting assistance from Codex (the model that powers Copilot) and the other half writing the code themselves. However, it’s also narrower than ‘sleep at the keyboard’: we only looked at one task and one language (writing a linked list in C).”

In the latest report, “Security Implications of Large Language Model Code Assistants: A User Study,” a slightly diverse group of NYU researchers concede that previous work has not realistically modeled the use of LLM-based tools like Copilot.

“First of all, these studies assume that all code is automatically generated by the LLM (we call this the autopilot mode),” the inventors explain in their paper.

“In practice, code completion LLMs support developers with suggestions that they accept, edit, or reject. This means that programmers who are biased towards automation naively accept faulty completions, but other developers produce less faulty code by using the time saved to fix bugs.”

Second, they note that while LLMs are proven to produce buggy code, so do humans. The errors in the LLM training data came from humans.

Instead of solely assessing the flawiness of LLM-generated code, they set out to compare how code created by human developers using machine learning models differs from code created by self-programming.

NYU computer scientists recruited 58 survey participants — undergraduate and graduate students in software development courses — and divided them into a control group that would work without suggestions and an assisted group that had access to a custom suggestion system built using OpenAI’s Codex API. They also used the Codex model to create 30 solutions to the given programming problems as a point of comparison. This autopilot group acted primarily as a second control group.

Both the supported and controlled groups were allowed to consult web resources such as Google and Stack Overflow, but were not allowed to ask others for help. The work was done in Visual Studio Code inside a web-based container built with open source Anubis.

Participants were asked to complete a shopping list program using the C programming language because “developers easily accidentally express vulnerable design patterns in C” and because the C compiler toolchain used does not check for errors to the same extent as toolchains for modern languages ​​like Go and Rust do this.

When the researchers manually analyzed the code created by the Control and Assistant groups, they found that, unlike previous work, AI code suggestions didn’t make the situation worse overall.

Looks clear, but there are details

“[W]We found no evidence that Codex support increases the incidence of security bugs,” the paper said, while noting that the study’s small sample size means further study is warranted. “On the contrary, there is some evidence to suggest that CWEs/LoC [lines of code] Reduce with Codex support.”

“It’s hard to conclude this with a lot of statistical certainty,” Siddharth Garg, a cybersecurity researcher and associate professor in the engineering department at NYU Tandon, said in a phone interview with The registry.

It’s hard to conclude this with much statistical certainty

Nonetheless, he said, “The data suggests that Copilot users weren’t faring much worse.”

Dolan-Gavitt is similarly cautious about the results.

“Current analysis of the results of our user study did not reveal any statistically significant differences – we are still analyzing this, including qualitatively, so I would not draw any strong conclusions from this, particularly as it was a small study (58 users in total) and the users were students rather than professional developers,” he said.

“Nevertheless, we can say that for these users on this task, the impact of AI assistance on safety was probably not large: if it had had a very large impact, we would have observed a larger difference between the two groups. We’re doing a little more statistical analysis to get that right now.”

In addition, some other findings emerged. One is that the participants in the wizard group were more productive, generated more lines of code, and completed a larger proportion of the functions in the task.

“Users in the Assisted group passed more functional tests and produced more functional code,” Garg said, adding that results of this type could help companies looking for assistive coding tools to decide whether to use them.

Another reason is that the researchers were able to distinguish the results produced by the Control, Assisted, and Autopilot groups, which could allay concerns about AI performance fraud in educational institutions.

The experts also noted that AI tools need to be viewed in the context of user error. “Users provide prompts that may contain errors, accept erroneous prompts that end up in the ‘completed’ programs, and accept bugs that are later removed,” the paper reads. “In some cases, users also have more errors than suggested by the model!”

Expect more work in this direction. ® AI code wizards may not produce as many errors as feared • The registry

Rick Schindler

World Time Todays is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – The content will be deleted within 24 hours.

Related Articles

Back to top button