AI Eats Your Work... Literally: News Outlet Sues OpenAI Over Copyright Stripping

cover
13 Aug 2024

The Center for Investigative Reporting Inc. v. OpenAI Court Filing, retrieved on June 27, 2024, is part of HackerNoon’s Legal PDF Series. You can jump to any part in this filing here. This part is 5 of 18.

DEFENDANTS’ UNAUTHORIZED USE OF PLAINTIFF’S WORKS IN THEIR TRAINING SETS

46. OpenAI was formed in December 2015 as a “non-profit artificial intelligence research company” but quickly became a multi-billion-dollar for-profit business built on the exploitation of copyrighted works belonging to creators around the world, including CIR. Unlike =CIR, OpenAI shed its exclusive nonprofit status just three years after its founding and created OpenAI LP in March 2019, a for-profit company dedicated to its for-profit activities including product development and raising capital from investors.

47. Defendants’ GenAI products utilize a “large language model,” or “LLM.” The different versions of GPT are examples of LLMs. An LLM, including those that power ChatGPT and Copilot, take text prompts as inputs and emit outputs to predict responses that are likely to follow a given the potentially billions of input examples used to train it.

48. LLMs arrive at their outputs as the result of their training on works written by humans, which are often protected by copyright. They collect these examples in training sets.

49. When assembling training sets, LLM creators, including Defendants, first identify the works they want to include. They then encode the work in computer memory as numbers called “parameters.”

50. Defendants have not published the contents of the training sets used to train any version of ChatGPT, but have disclosed information about those training sets prior to GPT-4.[3] Beginning with GPT-4, Defendants have been fully secret about the training sets used to train that and later versions of ChatGPT. Plaintiff’s allegations about Defendants’ training sets are therefore based upon an extensive review of publicly available information regarding earlier versions of ChatGPT and consultations with a data scientist employed by Plaintiff’s counsel to analyze that information and provide insights into the manner in which AI is developed and functions.

51. Microsoft has built its own AI product, called Copilot, which uses Microsoft’s Prometheus technology. Prometheus combines the Bing search product with the OpenAI Defendants’ GPT models into a component called Bing Orchestrator. When prompted, Copilot responds to user queries using Bing Orchestrator by providing AI-rewritten abridgements or regurgitations of content found on the internet.[4]

52. Earlier versions of ChatGPT (prior to GPT-4) were trained using at least the following training sets: WebText, WebText2, and sets derived from Common Crawl.

53. WebText and WebText2 were created by the OpenAI Defendants. They are collections of all outbound links on the website Reddit that received at least three “karma.”[5] On Reddit, a karma indicates that users have generally approved the link. The difference between the datasets is that WebText2 involved scraping links from Reddit over a longer period of time. Thus, WebText2 is an expanded version of WebText.

54. The OpenAI Defendants have published a list of the top 1,000 web domains present in the WebText training set and their frequency. According to that list, 16,793 distinct URLs from Mother Jones’s web domain appear in WebText.[6]

55. Defendants have a record, and are aware, of each URL that was included in each of their training sets.

56. Joshua C. Peterson, currently an assistant professor in the Faculty of Computing and Data Sciences at Boston University, and two computational cognitive scientists with PhDs from U.C. Berkeley, created an approximation of the WebText dataset, called OpenWebText, by also scraping outbound links from Reddit that received at least three “karma,” just like the OpenAI Defendants did in creating WebText.[7] They published the results online. A data scientist employed by Plaintiff’s counsel then analyzed those results. OpenWebText contains 17,019 distinct URLs from motherjones.com and 415 from revealnews.org. A list of the Mother Jones works contained in OpenWebText is attached as Exhibit 2. A list of the Reveal works contained in OpenWebText is attached as Exhibit 3.

57. Upon information and belief, there are slightly different numbers of Mother Jones articles in WebText and OpenWebText at least in part because the scrapes occurred on different dates.

58. OpenAI has explained that, in developing WebText, it used sets of algorithms called Dragnet and Newspaper to extract text from websites.[8] Upon information and belief, OpenAI used these two extraction methods, rather than one method, to create redundancies in case one method experienced a bug or did not work properly in a given case. Applying two methods rather than one would lead to a training set that is more consistent in the kind of content it contains, which is desirable from a training perspective.

59. Dragnet’s algorithms are designed to “separate the main article content” from other parts of the website, including “footers” and “copyright notices,” and allow the extractor to make further copies only of the “main article content.”[9] Dragnet is also unable to extract author and title information from the header or byline, and extracts it only if it happens to be separately contained in the main article content. Put differently, copies of news articles made by Dragnet are designed not to, contain author, title, copyright notices, and footers, and do not contain such information unless it happens to be contained in the main article content.

60. Like Dragnet, the Newspaper algorithms are incapable of extracting copyright notices and footers. Further, a user of Newspaper has the choice to extract or not extract author and title information. On information and belief, the OpenAI Defendants chose not to extract author and title information because they desired consistency with the Dragnet extractions, and Dragnet is typically unable to extract author and title information.

61. In applying the Dragnet and Newspaper algorithms while assembling the WebText dataset, the OpenAI Defendants removed Plaintiff’s author, title, copyright notice, and terms of use information, the latter of which is contained in the footers of Plaintiff’s websites.

62. Upon information and belief, the OpenAI Defendants, when using Dragnet and Newspaper, first download and save the relevant webpage before extracting data from it. This is at least because, when they use Dragnet and Newspaper, they likely anticipate a possible future need to regenerate the dataset (e.g., if the dataset becomes corrupted), and it is cheaper to save a copy than it is to recrawl all the data.

63. Because, by the time of its scraping, Dragnet and Newspaper were publicly known to remove author, title, copyright notices, and footers, and given that OpenAI employs highly skilled data scientists who would know how Dragnet and Newspaper work, the OpenAI Defendants intentionally and knowingly removed this copyright management information while assembling WebText.

64. A data scientist employed by Plaintiff’s counsel applied the Dragnet code to three Reveal URLs contained in OpenWebText. The results are attached as Exhibit 4. The resulting copies, whose text is substantively identical to the original (e.g., identical except for the seemingly random addition of an extra space between two words, or the exclusion of a description associated with an embedded photo), lack the author, title, copyright notice, and terms of use information with which they were conveyed to the public, except in some cases where author information happened to be contained in the main article content. The Dragnet code failed when the data scientist attempted to apply it to Mother Jones articles, further corroborating the OpenAI Defendants’ need for redundancies referenced above.

65. A data scientist employed by Plaintiff’s counsel also applied the Newspaper code to three Mother Jones and three Reveal URLs contained in OpenWebText. The data scientist applied the version of the code that enables the user not to extract author and title information based on the reasonable assumption that the OpenAI Defendants desired consistency with the Dragnet extractions. The results are attached as Exhibit 5. The resulting copies, whose text is substantively identical to the original, lack the author, title, copyright notice, and terms of use information with which they were conveyed to the public, except in some cases where author information happened to be contained in the main article content.

66. The absence of author, title, copyright notice, and terms of use information from the copies of Plaintiff’s articles generated by applying the Dragnet and Newspaper codes—codes OpenAI has admitted to have intentionally used when assembling WebText—further corroborates that the OpenAI Defendants intentionally removed author, title, copyright notice, and terms of use information from Plaintiff’s copyright-protected news articles.

67. Upon information and belief, the OpenAI Defendants have continued to use the same or similar Dragnet and Newspaper text extraction methods when creating training sets for every version of ChatGPT since GPT-2. This is at least because the OpenAI Defendants have admitted to using these methods for GPT-2 and have neither publicly disclaimed their use for later version of ChatGPT nor publicly claimed to have used any other text extraction methods for those later versions.

68. The other repository the OpenAI Defendants have admitted to using, Common Crawl, is a scrape of most of the internet created by a third party.

69. To train GPT-2, OpenAI downloaded Common Crawl data from the third party’s website and filtered it to include only certain works, such as those written in English.[10]

70. Google has published instructions on how to replicate a dataset called C4, a monthly snapshot of filtered Common Crawl data that Google used to train its own AI models. Upon information and belief, based on the similarity of Defendants’ and Google’s goals in training AI models, C4 is substantially similar to the filtered versions of Common Crawl used to train ChatGPT. The Allen Institute for AI, a nonprofit research institute launched by Microsoft cofounder Paul Allen, followed Google’s instructions and published its recreation of C4 online.[11]

71. A data scientist employed by Plaintiff’s counsel analyzed this recreation. It contains 26,178 URLs originating from motherjones.com. The vast majority of these URLs contain Plaintiff’s copyright-protected news articles. None contain terms of use information. None contain copyright notice information as to Plaintiff’s copyright-protected news articles. The majority also lack author and title information. In some cases, the articles are substantively identical, while in others a small number of paragraphs are omitted.

72. This recreation also contains 451 articles originating from revealnews.org. The vast majority of these URLs contain Plaintiff’s copyright-protected news articles. None of the news articles contains copyright notice or terms of use information. The majority also lack author and title information. In some cases, the articles are substantively identical, while in others a small number of paragraphs is omitted.

73. As a representative sample, the text of three Mother Jones and three Reveal articles as they appear in the C4 set is attached as Exhibit 6. None of these articles contains the author, title, copyright notice, or terms of use information with which they were conveyed to the public.

74. Plaintiff has not licensed or otherwise permitted Defendants to include any of its works in their training sets.

75. Downloading tens of thousands of Plaintiff’s articles without permission infringes Plaintiff’s copyrights, more specifically, the right to control reproductions of copyright-protected works.

Continue Reading Here.


About HackerNoon Legal PDF Series: We bring you the most important technical and insightful public domain court case filings.

This court case retrieved on June 27, 2024, motherjones.com is part of the public domain. The court-created documents are works of the federal government, and under copyright law, are automatically placed in the public domain and may be shared without legal restriction.

[3] Plaintiff collectively refers to all versions of ChatGPT as “ChatGPT” unless a specific version is specified.

[4] https://blogs.bing.com/search-quality-insights/february-2023/Building-the-New-Bing

[5] Alec Radford et al, Language Models are Unsupervised Multitask Learners, 3 https://cdn.openai.com/better-languagemodels/language_models_are_unsupervised_multitask_learners.pdf.

[6] https://github.com/openai/gpt-2/blob/master/domains.txt.

[7] https://github.com/jcpeterson/openwebtext/blob/master/README.md.

[8] Alec Radford et al., Language Models are Unsupervised Multitask Learners, 3 https://cdn.openai.com/better-languagemodels/language_models_are_unsupervised_multitask_learners.pdf.

[9] Matt McDonnell, Benchmarking Python Content Extraction Algorithms (Jan. 29, 2015), https://moz.com/devblog/benchmarking-python-content-extraction-algorithms-dragnetreadability-goose-and-eatiht.

[10] Tom B. Brown et al, Language Models are Few-Shot Learners, 14 (July 22, 2020), https://arxiv.org/pdf/2005.14165.

[11] https://huggingface.co/datasets/allenai/c4.