AI-Generated Code and Copyright Infringement: Codex’s Attribution Problem

cover
2 Sept 2023

DOE vs. Github (amended complaint) Court Filing (Redacted), June 8, 2023 is part of HackerNoon’s Legal PDF Series. You can jump to any part in this filing here. This is part 14 of 38.

VI. CLASS ALLEGATIONS

B. Codex Outputs Copyrighted Materials Without Following the Terms of the Applicable Licenses

52. Below is an explanation of how Codex functions. When Codex is prompted with: function isEven(n) { it assumes this is the beginning of a function written in the JavaScript language that will test whether a number is even.

53. Based on this assumption, Codex will then provide Output meant to complete the rest of the function. Based on the given prompt, it produced the following response:[7]

function isEven(n) {

54. The function itself occupies the first ten lines. Six additional lines follow the function, beginning with “console.log(isEven(50))”. One possible explanation for Codex’s inclusion of these lines is to test the “isEven” function. Though not part of the function itself, the lines will confirm the function works for certain values. In this case, the code implies that “isEven(50)” should return the value “true”, and “isEven(75)” should return “false”. Those answers are correct.

55. The penultimate line indicates “isEven(‐1)” should return “??”. This is an error, as “isEven(‐1)” should return “false”.

56. Codex cannot and does not understand the meaning of software code or any other Licensed Materials. But in training, what became Codex was exposed to an enormous amount of existing software code (its “Training Data”) and—with input from its trainers and its own internal processes—inferred certain statistical patterns governing the structure of code and other Licensed Materials. The finished version of Codex, once trained, is known as a “Model.”

57. When given a prompt, such as the initial prompt discussed above—“function isEven(n) {”—Codex identifies the most statistically likely completion, based on the examples it reviewed in training. Every instance of Output from Codex is derived from material in its Training Data. Most of its Training Data consisted of Licensed Materials.

58. Codex does not “write” code the way a human would, because it does not understand the meaning of code. Codex’s lack of understanding of code is evidenced when it emits extra code that is not relevant under the circumstances. Here, Codex was only prompted to produce a function called “isEven”. To produce its answer, Codex relied on Training Data that also appended the extra testing lines. Having encountered this function and the follow-up lines together frequently, Codex extrapolates they are all part of one function. A human with even a basic understanding of how JavaScript works would know the extra lines are not part of the function itself

59. Beyond the superfluous and inaccurate extra lines, this “isEven” function also contains two major defects. First, it assumes the variable “n” holds an integer. It could contain some other kind of value, like a decimal number or text string, which would cause an error. Second, even if “n” does hold an integer, the function will trigger a memory error called a “stack overflow” for sufficiently large integers. For these reasons, experienced programmers would not use Codex’s Output.

60. Codex does not identify the owner of the copyright to this Output, nor any other—it has not been trained to provide Attribution. Nor does it include a Copyright Notice nor any License Terms attached to the Output. This is by design—Codex was not coded or trained to track or reproduce such data. The Output in the example above is taken from Eloquent JavaScript by Marijn Haverbeke.[8]

61. Here is the exercise from Eloquent JavaScript:

// Your code here.

console.log(isEven(50));

// → true

console.log(isEven(75));

// → false

console.log(isEven(‐1));

// → ??

62. The exercise includes the “??” error. However, for Haverbeke’s purposes, this is not an error but a placeholder value for the reader to fill in. Codex—as a mere probabilistic model—fails to recognize this nuance. The inclusion of the double question marks confirms unequivocally that Codex took this code directly from a copyrighted source without following any of the attendant License Terms.

63. Haverbeke provides the following solution to the function discussed above:

function isEven(n) {

  if (n == 0) return true;

  else if (n == 1) return false;

  else if (n < 0) return isEven(‐n);

  else return isEven(n ‐ 2);

}

console.log(isEven(50));

// → true

console.log(isEven(75));

// → false

console.log(isEven(‐1));

// → false

64. Aside from different line breaks—which are not semantically meaningful in JavaScript—this code for the function “isEven” is the same as what Codex produced. The tests are also the same, though in this case Haverbeke provides the right answer for “isEven(‐1)”, which is “false”. Codex has reproduced Haverbeke’s Licensed Material almost verbatim, with the only difference being drawn from a different portion of those same Licensed Materials.

65. There are many copies of Haverbeke’s code stored in public repositories on GitHub, where programmers who are working through Haverbeke’s book store their answers.

66. The MIT license provides that “The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.”[9] Any person taking this code directly from Eloquent JavaScript would have direct access to these License Terms and know to follow them if incorporating the Licensed Materials into a derivative work and/or copying them. Codex does not provide these License Terms.

67. OpenAI Codex’s Output would frequently, perhaps even constantly, contain Licensed Materials, i.e., it would have conditions associated with it through its associated license. In its 2021 research paper about Codex called “Evaluating Large Language Models Trained on Code,” OpenAI stated Codex’s Output is “often incorrect” and can contain security vulnerabilities and other “misalignments” (meaning, departures from what the user requested).

68. Most open-source licenses require attribution of the author, notice of their copyright, and a copy of the license specifically to ensure that future coders can easily credit all previous authors and ensure they adhere to all applicable licenses. All the Suggested Licenses include these requirements.

69. Ultimately, Codex derives its value primarily from its ability to locate and output potentially useful Licensed Materials. And from its obfuscation of any rights associated with those materials.


[7] Due to the nature of Codex, Copilot, and AI in general, Plaintiffs cannot be certain these examples would produce the same results if attempted following additional trainings of Codex and/or Copilot. However, these examples are representative of Codex and Copilot’s Output at the time just prior to the filing of this Complaint.

[8] https://eloquentjavascript.net/code/#3.2. Eloquent JavaScript is “Licensed under a Creative Commons [A]ttribution-[N]oncommercial license. All code in this book may also be considered licensed under an MIT license.” See https://eloquentjavascript.net/. Thus, having also been posted on GitHub, the code Codex relied on meets the definition of Licensed Materials.

Continue Reading Here.


About HackerNoon Legal PDF Series: We bring you the most important technical and insightful public domain court case filings.

This court case 4:22-cv-06823-JST retrieved on August 26, 2023, from Storage Courtlistener is part of the public domain. The court-created documents are works of the federal government, and under copyright law, are automatically placed in the public domain and may be shared without legal restriction.