GitHub Copilot, one of several newer tools for generating programming code proposals using AI models, remains problematic for some users due to licensing concerns and the telemetry the software sends back to Microsoft’s own company.
This is what Brendan Dolan-Gavitt, an assistant professor in the Department of Computer Science and Engineering at NYU Tandon, has published FauxPilotan alternative to Copilot that runs locally without phoning home to the Microsoft mothership.
Copilot relies on OpenAI Codex, a natural language-to-code system based on GPT-3, based on “Billions of lines of public code‘ in GitHub repositories. This has made Free and Open Source Software (FOSS) advocates uneasy because Microsoft and GitHub have failed to identify exactly which repositories informed Codex.
As Bradley Kuhn, Policy Fellow at the Software Freedom Conservancy (SFC), wrote a blog post Earlier this year, “Copilot left copyleft compliance as an exercise for users. Users are likely to face growing liability, which will only increase as Copilot improves.
Shortly after GitHub Copilot was commercially available, the SFC prompted the open source maintainers GitHub partially unusable due to its refusal to address Copilot concerns.
Not a perfect world
FauxPilot does not use Codex. It relies on Salesforce CodeGen Model. This is unlikely to appease FOSS proponents, however, as CodeGen has also been trained on public open-source code with little regard for the nuances of different licenses.
“The models currently in use were trained by Salesforce, and they were retrained using essentially all of GitHub’s public code,” Dolan-Gavitt explained in a phone interview with The registry. “So there are still some issues, possibly with licensing, that this wouldn’t solve.”
“On the other hand, if someone with enough computing power came along and said, ‘I’m going to train a model that’s only trained on GPL code or has a license that allows me to reuse it without attribution’ or something like that, they could train their model , drop this model into FauxPilot and use this model instead.”
For Dolan-Gavitt, the main goal of FauxPilot is to provide a way to run the AI assistant software on-premises.
“There are people who have privacy concerns, or maybe, in the case of work, some company policy that prevents them from sending their code to third parties, and that’s definitely helped by being able to run it locally,” he explained.
GitHub, in its description of what data Copilot collectsdescribes an option to disable the collection of code snippets data containing “source code you are editing, associated files and other files open in the same IDE or editor, URLs of repositories and file paths”.
but included doesn’t seem to disable the collection of user engagement data – “user edit actions like accepted and discarded completions, as well as error and general usage data to identify metrics like latency and feature engagement” and possibly “personal data like pseudonymous identifiers”.
Dolan-Gavitt said he sees FauxPilot as a research platform.
“One thing we want to do is train code models that will hopefully emit safer code,” he explained. “And once we’ve done that, we want to be able to test them and maybe even test them with actual users using something like Copilot but with our own models. So that was kind of a motivation.”
However, this comes with some challenges. “Right now, it’s a bit impractical to try to create a dataset that doesn’t have security vulnerabilities because the models are really data-hungry,” Dolan-Gavitt said.
“So they want lots of code to train with. But we don’t have very good or foolproof ways to ensure code is bug-free. So it would be an immense amount of work to try to curate a dataset that was vulnerability-free.”
Nonetheless, Dolan-Gavitt, the co-author a paper on the uncertainty of copilot code proposalsfinds AI support useful enough to stick with.
“My personal feeling about it is that I’ve basically turned on Copilot since it came out last summer,” he explained. “I find it really useful. However, I need to double-check his work. But often it’s easier for me to at least start with something it gives me and then fix it than to try to create it from scratch.” ®
https://www.theregister.com/2022/08/06/fauxpilot_github_copilot/ Like GitHub Copilot without Microsoft Telemetry • The Register