DOE v. Github (original complaint) Court Filing, retrieved on November 3, 2022 is part of HackerNoon’s Legal PDF Series. You can jump to any part in this filing here. This is part 18 of 37.
VII. FACTUAL ALLEGATIONS
E. Copilot Was Launched Despite Its Propensity for Producing Unlawful Outputs
82. GitHub and OpenAI have not provided much detail regarding what data Codex and OpenAI were trained on. Plaintiffs know for certain from GitHub and OpenAI’s statements, that both systems were trained on publicly available GitHub repositories, with Copilot having been trained on all available public GitHub repositories. Thus, if Licensed Materials have been posted to a GitHub public repository, Plaintiffs and the Class can be reasonably certain it was ingested by Copilot and is sometimes returned to users as Output.
83. According to OpenAI, Codex was trained on “billions of lines of source code from publicly available sources, including code in public GitHub repositories”. Similarly, GitHub has described[13] Copilot’s training material as “billions of lines of public code.” GitHub researcher Eddie Aftandilian confirmed in a recent podcast[14] that Copilot is “train[ed] on public repos on GitHub.”
84. In a recent customer-support message, GitHub’s support department clarified certain facts about training Copilot. First, GitHub said that “training for Codex (the model used by Copilot) is done by OpenAI, not GitHub.” Second, in its support message, GitHub put forward a more detailed justification for its use of copyrighted code as training data:
Training machine learning models on publicly available data is
considered fair use across the machine learning community . . .
OpenAI’s training of Codex is done in accordance with global
copyright laws which permit the use of publicly accessible materials
for computational analysis and training of machine learning
models, and do not require consent of the owner of such materials.
Such laws are intended to benefit society by enabling machines to
learn and understand using copyrighted works, much as humans
have done throughout history, and to ensure public benefit, these
rights cannot generally be restricted by owners who have chosen to
make their materials publicly accessible.
The claim that training ML models on publicly available code is widely accepted as fair use is not true. And regardless of this concept’s level of acceptance in “the machine learning community,” under Federal law, it is illegal.
85. Former GitHub CEO Nat Friedman said in June 2021—when Copilot was released to a limited number of customers—that “training ML systems on public data is fair use.”[15] Friedman’s statement is pure speculation; no Court has considered the question of whether “training ML systems on public data is fair use.” The Fair Use affirmative defense is only applicable to Section 501 copyright infringement. It is not a defense to violations of the DMCA, Breach of Contract, nor any other claim alleged herein. It cannot be used to avoid liability here. At the same time Friedman asserted “the output [of Copilot] belongs to the operator.”
86. Other open-source stakeholders have made this point already. For example, in June 2021, Software Freedom Conservancy (“SFC”), a prominent open-source advocacy organization, asked Microsoft and GitHub to provide “legal references for GitHub’s public legal positions.” No references were provided by any of the Defendants.[16]
87. Beyond the examples above, Copilot regularly Output’s verbatim copies of Licensed Materials. For example, Copilot reproduced verbatim well-known code from the game Quake III, use of which is governed by one of the Suggested Licenses—GPL-2.[17]
88. Copilot also reproduced code that had been released under a license that allowed its use only for free games and required attribution by including a copy of the license. Copilot did not mention nor include the underlying license when providing a copy of this code as Output.[18]
89. Texas A&M computer-science professor Tim Davis has provided numerous examples of Copilot reproducing code belonging to him without its license or attribution.[19]
90. GitHub concedes that in ordinary use, Copilot will reproduce passages of code verbatim: “Our latest internal research shows that about 1% of the time, a suggestion [Output] may contain some code snippets longer than ~150 characters that matches” code from the training data. This standard is more limited than is necessary for copyright infringement. But even using GitHub’s own metric and the most conservative possible criteria, Copilot has violated the DMCA at least tens of thousands of times.
91. In June 2022, Copilot had 1,200,000 users. If only 1% of users have ever received Output based on Licensed Materials and only once each, Defendants have “only” breached Plaintiffs’ and the Class’s Licenses 12,000 times. However, each time Copilot outputs Licensed Materials without attribution, the copyright notice, or the License Terms it violates the DMCA three times. Thus, even using this extreme underestimate, Copilot has “only” violated the DMCA 36,000 times.[20] Because Copilot constantly Outputs code as a user writes, and because nearly all of Copilot’s training data was Licensed Material, this number is most likely exponentially lower than the true number of breaches and DMCA violations.
[13] https://github.blog/2021-06-30-github-copilot-research-recitation/.
[14] https://www.se-radio.net/2022/10/episode-533-eddie-aftandilian-on-github-copilot/.
[15] https://twitter.com/natfriedman/status/1409914420579344385/.
[16] https://sfconservancy.org/blog/2022/feb/03/github-copilot-copyleft-gpl/.
[17] https://twitter.com/stefankarpinski/status/1410971061181681674/.
[18] https://twitter.com/ChrisGr93091552/status/1539731632931803137/.
[19] https://twitter.com/DocSparse/status/1581461734665367554/.
[20] These violations of Section 1202 of the DMCA each incur statutory damages of “not less than $2,500 or more than $25,000.” 17 U.S.C. § 1203(c)(3)(B). This extremely conservative estimate of Defendants’ number of direct violations translates to $90 million to $900 million in statutory damages.
Continue Reading Here.
About HackerNoon Legal PDF Series: We bring you the most important technical and insightful public domain court case filings.
This court case 3:22-cv-06823-KAW retrieved on September 5, 2023, from Storage.Courtlistener is part of the public domain. The court-created documents are works of the federal government, and under copyright law, are automatically placed in the public domain and may be shared without legal restriction.