{"id":92919,"date":"2025-04-02T03:19:48","date_gmt":"2025-04-02T03:19:48","guid":{"rendered":"https:\/\/neclink.com\/index.php\/2025\/04\/02\/researchers-suggest-openai-trained-ai-models-on-paywalled-oreilly-books\/"},"modified":"2025-04-02T03:19:48","modified_gmt":"2025-04-02T03:19:48","slug":"researchers-suggest-openai-trained-ai-models-on-paywalled-oreilly-books","status":"publish","type":"post","link":"https:\/\/neclink.com\/index.php\/2025\/04\/02\/researchers-suggest-openai-trained-ai-models-on-paywalled-oreilly-books\/","title":{"rendered":"Researchers suggest OpenAI trained AI models on paywalled O&#8217;Reilly books"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">OpenAI has been <a href=\"https:\/\/www.theregister.com\/2024\/01\/12\/github_copilot_copyright_case_narrowed\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">accused<\/a> by <a rel=\"nofollow\" href=\"https:\/\/www.theregister.com\/2024\/01\/12\/github_copilot_copyright_case_narrowed\/\">many<\/a> parties of training its AI on copyrighted content sans permission. Now a new <a href=\"https:\/\/ssrc-static.s3.us-east-1.amazonaws.com\/OpenAI-Training-Violations-OReillyBooks_Sruly-OReilly-Strauss_SSRC_04012025.pdf\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">paper<\/a> by an AI watchdog organization makes the serious accusation that the company increasingly relied on nonpublic books it didn\u2019t license to train more sophisticated AI models.<\/p>\n<p class=\"wp-block-paragraph\">AI models are essentially complex prediction engines. Trained on a lot of data \u2014 books, movies, TV shows, and so on \u2014 they learn patterns and novel ways to extrapolate from a simple prompt. When a model \u201cwrites\u201d an essay on a Greek tragedy or \u201cdraws\u201d Ghibli-style images, it\u2019s simply pulling from its vast knowledge to approximate. It isn\u2019t arriving at anything new.<\/p>\n<p class=\"wp-block-paragraph\">While a number of AI labs, including OpenAI, have begun embracing AI-generated data to train AI as they exhaust real-world sources (mainly the public web), few have eschewed real-world data entirely. That\u2019s likely because training on purely synthetic data comes with risks, like worsening a model\u2019s performance.<\/p>\n<p class=\"wp-block-paragraph\">The new paper, out of the AI Disclosures Project, a nonprofit co-founded in 2024 by media mogul Tim O\u2019Reilly and economist Ilan Strauss, draws the conclusion that OpenAI likely trained its <a href=\"https:\/\/techcrunch.com\/2024\/05\/13\/openais-newest-model-is-gpt-4o\/\">GPT-4o<\/a> model on paywalled books from O\u2019Reilly Media. (O\u2019Reilly is the CEO of O\u2019Reilly Media.)<\/p>\n<p class=\"wp-block-paragraph\">In <a href=\"https:\/\/techcrunch.com\/tag\/chatgpt\/\">ChatGPT<\/a>, GPT-4o is the default model. O\u2019Reilly doesn\u2019t have a licensing agreement with OpenAI, the paper says.<\/p>\n<p class=\"wp-block-paragraph\">\u201cGPT-4o, OpenAI\u2019s more recent and capable model, demonstrates strong recognition of paywalled O\u2019Reilly book content\u00a0\u2026 compared to OpenAI\u2019s earlier model GPT-3.5 Turbo,\u201d wrote the co-authors of the paper. \u201cIn contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O\u2019Reilly book samples.\u201d<\/p>\n<p class=\"wp-block-paragraph\">The paper used a method called <a href=\"https:\/\/arxiv.org\/pdf\/2402.09910\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">DE-COP<\/a>, first introduced in an academic paper in 2024, designed to detect copyrighted content in language models\u2019 training data. Also known as a \u201cmembership inference attack,\u201d the method tests whether a model can reliably distinguish human-authored texts from paraphrased, AI-generated versions of the same text. If it can, it suggests that the model might have prior knowledge of the text from its training data.<\/p>\n<p class=\"wp-block-paragraph\">The co-authors of the paper \u2014 O\u2019Reilly, Strauss, and AI researcher Sruly Rosenblat \u2014 say that they probed GPT-4o, <a href=\"https:\/\/techcrunch.com\/2023\/08\/22\/openai-brings-fine-tuning-to-gpt-3-5-turbo\/\">GPT-3.5 Turbo<\/a>, and other OpenAI models\u2019 knowledge of O\u2019Reilly Media books published before and after their training cutoff dates. They used 13,962 paragraph excerpts from 34 O\u2019Reilly books to estimate the probability that a particular excerpt had been included in a model\u2019s training dataset.<\/p>\n<p class=\"wp-block-paragraph\">According to the results of the paper, GPT-4o \u201crecognized\u201d far more paywalled O\u2019Reilly book content than OpenAI\u2019s older models, including GPT-3.5 Turbo. That\u2019s even after accounting for potential confounding factors, the authors said, like improvements in newer models\u2019 ability to figure out whether text was human-authored.<\/p>\n<p class=\"wp-block-paragraph\">\u201cGPT-4o [likely] recognizes, and so has prior knowledge of, many non-public O\u2019Reilly books published prior to its training cutoff date,\u201d wrote the co-authors. <\/p>\n<p class=\"wp-block-paragraph\">It isn\u2019t a smoking gun, the co-authors are careful to note. They acknowledge that their experimental method isn\u2019t foolproof and that OpenAI might\u2019ve collected the paywalled book excerpts from users copying and pasting it into ChatGPT.<\/p>\n<p class=\"wp-block-paragraph\">Muddying the waters further, the co-authors didn\u2019t evaluate OpenAI\u2019s most recent collection of models, which includes GPT-4.5 and \u201creasoning\u201d models such as o3-mini and o1. It\u2019s possible that these models weren\u2019t trained on paywalled O\u2019Reilly book data or were trained on a lesser amount than GPT-4o.<\/p>\n<p class=\"wp-block-paragraph\">That being said, it\u2019s no secret that OpenAI, which has advocated for <a href=\"https:\/\/techcrunch.com\/2025\/03\/13\/openai-calls-for-u-s-government-to-codify-fair-use-for-ai-training\/\">looser restrictions<\/a> around developing models using copyrighted data, has been seeking higher-quality training data for some time. The company has gone so far as to <a rel=\"nofollow\" href=\"https:\/\/www.niemanlab.org\/2025\/02\/meet-the-journalists-training-ai-models-for-meta-and-openai\/\">hire journalists to help fine-tune its models\u2019 outputs<\/a>. That\u2019s a trend across the broader industry: AI companies recruiting experts in domains like science and physics to <a href=\"https:\/\/www.theinformation.com\/articles\/why-a-14-billion-startup-is-now-hiring-phds-to-train-ai-from-their-living-rooms\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">effectively have these experts feed their knowledge into AI systems<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">It should be noted that OpenAI pays for at least some of its training data. The company has licensing deals in place with news publishers, social networks, stock media libraries, and others. OpenAI also offers opt-out mechanisms \u2014 <a href=\"https:\/\/www.businessinsider.com\/openai-dalle-opt-out-process-artists-enraging-2023-9\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">albeit imperfect ones<\/a> \u2014 that allow copyright owners to flag content they\u2019d prefer the company not use for training purposes.<\/p>\n<p class=\"wp-block-paragraph\">Still, as OpenAI battles several suits over its training data practices and treatment of copyright law in U.S. courts, the O\u2019Reilly paper isn\u2019t the most flattering look.<\/p>\n<p class=\"wp-block-paragraph\">OpenAI didn\u2019t respond to a request for comment.<\/p>\n<\/div>\n<p><br \/>\n<br \/><a href=\"https:\/\/techcrunch.com\/2025\/04\/01\/researchers-suggest-openai-trained-ai-models-on-paywalled-oreilly-books\/\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>OpenAI has been accused by many parties of training its AI on copyrighted content sans permission. Now a new paper by an AI watchdog organization<\/p>\n","protected":false},"author":1,"featured_media":92920,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[178],"tags":[],"class_list":["post-92919","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts\/92919","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/comments?post=92919"}],"version-history":[{"count":0,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts\/92919\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/media\/92920"}],"wp:attachment":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/media?parent=92919"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/categories?post=92919"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/tags?post=92919"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}