{"id":91295,"date":"2025-02-22T02:30:30","date_gmt":"2025-02-22T02:30:30","guid":{"rendered":"https:\/\/neclink.com\/index.php\/2025\/02\/22\/court-filings-show-meta-staffers-discussed-using-copyrighted-content-for-ai-training\/"},"modified":"2025-02-22T02:30:30","modified_gmt":"2025-02-22T02:30:30","slug":"court-filings-show-meta-staffers-discussed-using-copyrighted-content-for-ai-training","status":"publish","type":"post","link":"https:\/\/neclink.com\/index.php\/2025\/02\/22\/court-filings-show-meta-staffers-discussed-using-copyrighted-content-for-ai-training\/","title":{"rendered":"Court filings show Meta staffers discussed using copyrighted content for AI training"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">For years, Meta employees have internally discussed using copyrighted works obtained through legally questionable means to train the company\u2019s AI models, according to court documents unsealed on Thursday. <\/p>\n<p class=\"wp-block-paragraph\">The documents were submitted by plaintiffs in the case Kadrey v. Meta, one of many AI copyright disputes slowly winding through the U.S. court system. The defendant, Meta, claims that training models on IP-protected works, particularly books, is \u201cfair use.\u201d The plaintiffs, who include authors Sarah Silverman and Ta-Nehisi Coates, disagree. <\/p>\n<p class=\"wp-block-paragraph\">Previous materials submitted in the suit alleged that Meta CEO Mark Zuckerberg\u00a0<a href=\"https:\/\/techcrunch.com\/2025\/01\/09\/mark-zuckerberg-gave-metas-llama-team-the-ok-to-train-on-copyrighted-works-filing-claims\/\" target=\"_blank\" rel=\"noreferrer noopener\">gave Meta\u2019s AI team the OK to train on copyrighted\u00a0content<\/a> and that <a href=\"https:\/\/techcrunch.com\/2025\/02\/14\/court-filings-show-meta-paused-efforts-to-license-books-for-ai-training\/\">Meta halted AI training data licensing talks with book publishers<\/a>. But the new filings, most of which show portions of internal work chats between Meta staffers, paint the clearest picture yet of how Meta may have come to use copyrighted data to train its models, including models in the company\u2019s <a href=\"https:\/\/techcrunch.com\/2024\/09\/08\/meta-llama-everything-you-need-to-know-about-the-open-generative-ai-model\/\">Llama family<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">In one chat, Meta employees, including Melanie Kambadur, a senior manager for Meta\u2019s Llama model research team, discussed training models on works they knew may be legally fraught.<\/p>\n<p class=\"wp-block-paragraph\">\u201c[M]y opinion would be (in the line of \u2018ask forgiveness, not for permission\u2019): we try to acquire the books and escalate it to execs so they make the call,\u201d wrote Xavier Martinet, a Meta research engineer, in a chat dated February 2023, <a href=\"https:\/\/storage.courtlistener.com\/recap\/gov.uscourts.cand.415175\/gov.uscourts.cand.415175.449.4.pdf\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">according to the filings<\/a>. \u201c[T]his is why they set up this gen ai org for [sic]: so we can be less risk averse.\u201d<\/p>\n<p class=\"wp-block-paragraph\">Martinet floated the idea of buying e-books at retail prices to build a training set rather than cutting licensing deals with individual book publishers. After another staffer pointed out that using unauthorized, copyrighted materials might be grounds for a legal challenge, Martinet doubled down, arguing that \u201ca gazillion\u201d startups were probably already using pirated books for training.<\/p>\n<p class=\"wp-block-paragraph\">\u201cI mean, worst case: we found out it is finally ok, while a gazillion start up [sic] just pirated tons of books on bittorrent,\u201d Martinet wrote, <a href=\"https:\/\/storage.courtlistener.com\/recap\/gov.uscourts.cand.415175\/gov.uscourts.cand.415175.449.4.pdf\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">according to the filings<\/a>. \u201c[M]y 2 cents again: trying to have deals with publishers directly takes a long time\u00a0\u2026\u201d<\/p>\n<p class=\"wp-block-paragraph\">In the same chat, Kambadur, who noted Meta was in talks with document hosting platform Scribd \u201cand others\u201d for licenses, cautioned that while using \u201cpublicly available data\u201d for model training would require approvals, Meta\u2019s lawyers were being \u201cless conservative\u201d than they had been in the past with such approvals. <\/p>\n<p class=\"wp-block-paragraph\">\u201cYeah we definitely need to get licenses or approvals on publicly available data still,\u201d Kambadur said, <a href=\"https:\/\/storage.courtlistener.com\/recap\/gov.uscourts.cand.415175\/gov.uscourts.cand.415175.449.4.pdf\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">according to the filings<\/a>. \u201c[D]ifference now is we have more money, more lawyers, more bizdev help, ability to fast track\/escalate for speed, and lawyers are being a bit less conservative on approvals.\u201d<\/p>\n<h2 class=\"wp-block-heading\" id=\"h-talks-of-libgen\">Talks of Libgen<\/h2>\n<p class=\"wp-block-paragraph\">In another work chat relayed in the filings, Kambadur discusses possibly using Libgen, a \u201clinks aggregator\u201d that provides access to copyrighted works from publishers, as an alternative to data sources that Meta might license. <\/p>\n<p class=\"wp-block-paragraph\">Libgen has been sued a number of times, ordered to shut down, and fined tens of millions of dollars for copyright infringement. One of Kambadur\u2019s colleagues <a href=\"https:\/\/storage.courtlistener.com\/recap\/gov.uscourts.cand.415175\/gov.uscourts.cand.415175.449.8.pdf\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">responded with a screenshot<\/a> of a Google Search result for Libgen containing the snippet \u201cNo, Libgen is not legal.\u201d<\/p>\n<p class=\"wp-block-paragraph\">Some decision-makers within Meta appear to have been under the impression that failing to use Libgen for model training could seriously hurt Meta\u2019s competitiveness in the AI race, <a href=\"https:\/\/storage.courtlistener.com\/recap\/gov.uscourts.cand.415175\/gov.uscourts.cand.415175.449.13.pdf\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">according to the filings<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">In an email addressed to Meta AI VP Joelle Pineau, Sony Theakanath, director of product management at Meta, called Libgen \u201cessential to meet SOTA numbers across all categories,\u201d referring to topping the best, state-of-the-art (SOTA) AI models and benchmark categories.<\/p>\n<p class=\"wp-block-paragraph\">Theakanath also outlined \u201cmitigations\u201d in the email intended to help reduce Meta\u2019s legal exposure, including removing data from Libgen \u201cclearly marked as pirated\/stolen\u201d and also simply not publicly citing usage. \u201cWe would not disclose use of Libgen datasets used to train,\u201d as Theakanath put it.<\/p>\n<p class=\"wp-block-paragraph\">In practice, these mitigations entailed combing through Libgen files for words like \u201cstolen\u201d or \u201cpirated,\u201d <a href=\"https:\/\/x.com\/jason_kint\/status\/1892977866834420103\/photo\/1\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">according to the filings<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">In a <a href=\"https:\/\/x.com\/jason_kint\/status\/1892977924912935097\/photo\/1\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">work chat<\/a>, Kambadur <a href=\"https:\/\/x.com\/jason_kint\/status\/1892978406817497285\/photo\/1\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">mentioned<\/a> that Meta\u2019s AI team also tuned models to \u201cavoid IP risky prompts\u201d \u2014 that is, configured the models to refuse to answer questions like \u201creproduce the first three pages of \u2018Harry Potter and the Sorcerer\u2019s Stone\u2019\u201d or \u201ctell me which e-books you were trained on.\u201d<\/p>\n<p class=\"wp-block-paragraph\">The filings contain other revelations, implying that Meta <a href=\"https:\/\/x.com\/jason_kint\/status\/1892978411649282295\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">may have scraped Reddit data<\/a> for some type of model training, possibly by mimicking the behavior of a third-party app called <a href=\"https:\/\/pushshift.io\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Pushshift<\/a>. Notably, Reddit <a href=\"https:\/\/techcrunch.com\/2023\/04\/18\/reddit-will-begin-charging-for-access-to-its-api\/\">said<\/a> in April 2023 that it planned to begin charging AI companies to access data for model training.<\/p>\n<p class=\"wp-block-paragraph\">In <a rel=\"nofollow\" href=\"https:\/\/storage.courtlistener.com\/recap\/gov.uscourts.cand.415175\/gov.uscourts.cand.415175.449.14.pdf\">one chat dated March 2024<\/a>, Chaya Nayak, director of product management at Meta\u2019s generative AI org, said that Meta leadership was considering \u201coverriding\u201d past decisions on training sets, including a decision not to use Quora content or licensed books and scientific articles, to ensure the company\u2019s models had sufficient training data.<\/p>\n<p class=\"wp-block-paragraph\">Nayak implied that Meta\u2019s first-party training datasets \u2014 Facebook and Instagram posts, text transcribed from videos on Meta platforms, and certain <a href=\"https:\/\/business.facebook.com\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Meta for Business<\/a> messages \u2014 simply weren\u2019t enough. \u201c[W]e need more data,\u201d she wrote.<\/p>\n<p class=\"wp-block-paragraph\">The plaintiffs in Kadrey v. Meta have amended their complaint several times since the case was filed in the U.S. District Court for the Northern District of California, San Francisco Division, in 2023. The latest alleges that Meta, among other claims, cross-referenced certain pirated books with copyrighted books available for license to determine whether it made sense to pursue a licensing agreement with a publisher.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">In a sign of how high Meta considers the legal stakes to be, the company <a href=\"https:\/\/chatgptiseatingtheworld.com\/2025\/02\/21\/meta-adds-legal-firepower-to-its-ai-defense-supreme-court-litigators-kannon-shanmugam-william-t-marks-of-paul-weiss\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">has added<\/a> two Supreme Court litigators from the law firm Paul Weiss to its defense team on the case.<\/p>\n<p class=\"wp-block-paragraph\">Meta didn\u2019t immediately respond to a request for comment.<\/p>\n<\/div>\n<p><br \/>\n<br \/><a href=\"https:\/\/techcrunch.com\/2025\/02\/21\/court-filings-show-meta-staffers-discussed-using-copyrighted-content-for-ai-training\/\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>For years, Meta employees have internally discussed using copyrighted works obtained through legally questionable means to train the company\u2019s AI models, according to court documents<\/p>\n","protected":false},"author":1,"featured_media":91296,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[149],"tags":[],"class_list":["post-91295","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-business"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts\/91295","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/comments?post=91295"}],"version-history":[{"count":0,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/posts\/91295\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/media\/91296"}],"wp:attachment":[{"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/media?parent=91295"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/categories?post=91295"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/neclink.com\/index.php\/wp-json\/wp\/v2\/tags?post=91295"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}