ChatGPT's Predecessor Could Translate French Even Though It Was Not Trained to Do So

cover
6 Mar 2024

The New York Times Company v. OpenAI Update Court Filing, retrieved on February 26, 2024 is part of HackerNoon’s Legal PDF Series. You can jump to any part in this filing here. This part is 2 of 15.

II. BACKGROUND

A. OpenAI’s Pioneering Research

OpenAI was founded in 2015 to “advance digital intelligence in the way that is most likely to benefit humanity as a whole.” Compl. ¶ 56. It entered the field of “natural language processing” (NLP), which includes the development of statistical tools called “language models.”[15] These models can “predict[] words that are likely to follow a given string of text” based on statistics derived from a body of text—much like a weather model can predict the rain using statistics derived from historical weather data. Compl. ¶ 75. By 2015, research had already unlocked “substantial progress” on “tasks such as reading comprehension” and “question answering.”[16]

Those early models, however, were “brittle” and “narrow.”[17] Researchers built them by “manually creat[ing] and label[ling]” datasets to “demonstrate[e] correct behavior”—like sets of English-to-French text translations—and using that data to “train a system to imitate [that] behavior[].” GPT-2 Paper at 1, 3. The resulting models, while impressive, could only carry out the specific tasks demonstrated by the training data. Id.; GPT-3 Paper at 3 (“need for task-specific datasets” was “a major limitation”). “To be broadly useful” to ordinary people, language models needed the ability to “seamlessly mix together or switch between many tasks and skills” without being specifically trained to carry out each task. GPT-3 Paper at 4. In other words, the models needed to be “competent generalists,” not “narrow experts.” GPT-2 Paper at 1.

OpenAI’s researchers set out to solve that complex, scientific problem. In 2019, they posited that the way to build more capable, generalist models was to use “as large and diverse a dataset as possible [] to collect natural language demonstrations of tasks in as varied of domains and contexts as possible.” GPT-2 Paper at 3. The hypothesis was that “[]training at a large enough scale [might] offer a ‘natural’ broad distribution of tasks implicitly contained in predicting the text itself.” GPT-3 Paper at 40. So instead of training its models “on a single domain of text,” OpenAI chose to use a richer and more diverse source: the Internet. GPT-2 Paper at 3.

OpenAI’s researchers identified text from webpages whose URLs had been publicly shared on a social media platform. Id. This became a dataset called “WebText,” which OpenAI used to train a model called “GPT-2.” Id.; see also Compl. ¶ 85. WebText contained a wide array of text from internet forums, restaurant reviews, recipe websites, blogs, shopping websites, dictionaries, medical websites, how-to pages, and more.[18] The dataset was so diverse that even though Times content represented only a tiny fraction of the data, the “NYTimes” was one of the “top 15 domains by volume” in the collection. See GPT-2 Model Card. This happened not because OpenAI believed Times articles are more “valu[able]” than other content, contra Compl. ¶ 2 (suggesting OpenAI intentionally “gave Times content particular emphasis”), but because of the frequency with which certain social media users shared links to the Times’s content, see GPT-2 Paper at 3.

The results of this sophisticated research were impressive. The GPT-2 model proved able to answer trivia questions and perform higher-function tasks like “resolv[ing] ambiguities in text.” GPT-2 Paper at 6–7. The model even showed a “surprising” ability to translate French to English, even though OpenAI had “deliberately removed non-English webpages” from the training dataset. Id. at 7. These research results were “exciting” not only because of the model’s capability, but because they scientifically confirmed that the ability to “perform commonsense reasoning” increased dramatically with the size and diversity of the training data. Id. at 6 (Figure 3).

Continue Reading Here.


[15] Sébastien Bubeck, et al., Sparks of Artificial General Intelligence: Early Experiments with GPT-4 at 4, 98 (Apr. 13, 2023), https://arxiv.org/pdf/2303.12712.pdf (“Bubeck Paper”); Compl. ¶¶ 71, 91 nn.9 & 24 (citing articles). By “refer[ing] [to these documents] in [its] complaint,” the Times incorporated them by reference. DiFolco v. MSNBC Cable L.L.C., 622 F.3d 104, 111–12 (2d Cir. 2010).

[16] OpenAI, Language Models are Few-Shot Learners at 3 (July 22, 2020), https://arxiv.org/pdf/2005.14165.pdf (“GPT-3 Paper”); see also Compl. ¶¶ 86, 90 & nn.18, 22 (citing and quoting this paper).

[17] OpenAI, Language Models are Unsupervised Multitask Learners at 1 (Feb. 14, 2019), https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (“GPT-2 Paper”); see also Compl. ¶ 85 n.15 (citing and quoting this paper).

[18] See OpenAI, GPT-2 Model Card, Github, https://github.com/openai/gpt-2/blob/master/model_card.md (last updated Nov. 2019) (“GPT-2 Model Card”); see also Compl. ¶ 85 nn. 14, 16, 17 (citing and quoting this source).


About HackerNoon Legal PDF Series: We bring you the most important technical and insightful public domain court case filings.

This court case retrieved on February 26, 2024, from fingfx.thomsonreuters.com is part of the public domain. The court-created documents are works of the federal government, and under copyright law, are automatically placed in the public domain and may be shared without legal restriction.