By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
KeyWordKingsKeyWordKingsKeyWordKings
  • AI Technology
    • AI Stategies
    • AI SEO News
    • AI + Traditional SEO Strategies
    • AI Applications Beyond SEO
    • AI for Technical SEO
    • AI-Powered SEO Tools
    • AI Content Creation
  • Local SEO
    • Google Profile
    • Local Content
    • Landing Pages
    • Local Listings
    • Mobile SEO
    • Google News
  • Marketing
    • AI-Enhanced User Experience
    • Ethical AI in SEO
    • Future of AI Marketing
    • Voice Search Optimization
  • Ecommerce
    • AI & Technical SEO
    • AI SEO
    • AI-Content
    • Chat Bots
    • AI News
Search
  • Contact
  • Blog
  • Complaint
  • Advertise
© 2025 KeywordKings. All Rights Reserved.
Reading: OpenAI Furious DeepSeek Might Have Stolen All the Data OpenAI Stole From Us
Share
Sign In
Notification Show More
Font ResizerAa
KeyWordKingsKeyWordKings
Font ResizerAa
  • Tech News
  • Gadget
  • Technology
  • Mobile
Search
  • Home
    • Home 1
    • Home 2
    • Home 3
    • Home 4
    • Home 5
  • Categories
    • Tech News
    • Gadget
    • Technology
    • Mobile
  • Bookmarks
  • More Foxiz
    • Sitemap
Have an existing account? Sign In
Follow US
  • Contact
  • Blog
  • Complaint
  • Advertise
© 2022 Foxiz News Network. Ruby Design Company. All Rights Reserved.

Blog | AI Technology | OpenAI Furious DeepSeek Might Have Stolen All the Data OpenAI Stole From Us

AI Technology

OpenAI Furious DeepSeek Might Have Stolen All the Data OpenAI Stole From Us

Last updated: January 29, 2025 2:49 pm
Share
CleanShot 2025 01 29 at
SHARE

The narrative that OpenAI, Microsoft, and freshly minted White House “AI czar” David Sacks are now pushing to explain why DeepSeek was able to create a large language model that outpaces OpenAI’s while spending orders of magnitude less money and using older chips is that DeepSeek used OpenAI’s data unfairly and without compensation. Sound familiar?

Both Bloomberg and the Financial Times are reporting that Microsoft and OpenAI have been probing whether DeepSeek improperly trained the R1 model that is taking the AI world by storm on the outputs of OpenAI models. 

Here is how the Bloomberg article begins: “Microsoft Corp. and OpenAI are investigating whether data output from OpenAI’s technology was obtained in an unauthorized manner by a group linked to Chinese artificial intelligence startup DeepSeek, according to people familiar with the matter.” The story goes on to say that “Such activity could violate OpenAI’s terms of service or could indicate the group acted to remove OpenAI’s restrictions on how much data they could obtain, the people said.”

The venture capitalist and new Trump administration member David Sacks, meanwhile, said that there is “substantial evidence” that DeepSeek “distilled the knowledge out of OpenAI’s models.” 

“There’s a technique in AI called distillation, which you’re going to hear a lot about, and it’s when one model learns from another model, effectively what happens is that the student model asks the parent model a lot of questions, just like a human would learn, but AIs can do this asking millions of questions, and they can essentially mimic the reasoning process they learn from the parent model and they can kind of suck the knowledge of the parent model,” Sacks told Fox News. “There’s substantial evidence that what DeepSeek did here is they distilled the knowledge out of OpenAI’s models and I don’t think OpenAI is very happy about this.” 

I will explain what this means in a moment, but first: Hahahahahahahahahahahahahahahahahahahhahahahahahahahahahahaha. It is, as many have already pointed out, incredibly ironic that OpenAI, a company that has been obtaining large amounts of data from all of humankind largely in an “unauthorized manner,” and, in some cases, in violation of the terms of service of those from whom they have been taking from, is now complaining about the very practices by which it has built its company. 

The argument that OpenAI, and every artificial intelligence company who has been sued for surreptitiously and indiscriminately sucking up whatever data it can find on the internet is not that they are not sucking up all of this data, it is that they are sucking up this data and they are allowed to do so. 

OpenAI is currently being sued by the New York Times for training on its articles, and its argument is that this is perfectly fine under copyright law fair use protections.

“Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness,” OpenAI wrote in a blog post. In its motion to dismiss in court, OpenAI wrote “it has long been clear that the non-consumptive use of copyrighted material (like large language model training) is protected by fair use.” 

OpenAI and Microsoft are essentially now whining about being beaten at its own game by DeepSeek. But additionally, part of OpenAI’s argument in the New York Times case is that the only way to make a generalist large language model that performs well is by sucking up gigantic amounts of data. It tells the court that it needs a huge amount of data to make a generalist language model, meaning any one source of data is not that important. This is funny, because DeepSeek managed to make a large language model that rivals and outpaces OpenAI’s own without falling into the more data = better model trap. Instead, DeepSeek used a reinforcement learning strategy that its paper claims is far more efficient than we’ve seen other AI companies do.

OpenAI’s motion to dismiss the New York Times lawsuit states as part of its argument that “the key to generalist language models” is “scale,” meaning that part of its argument is that any individual piece of stolen content cannot make a large language model, and that what allows OpenAI to make industry-leading large language models is this idea of scale. OpenAI’s lawyers quote from a New York Times article about this strategy as part of their argument: “The amount of data needed was staggering” to create GPT-3, it wrote. “‘It was that ‘unprecedented scale’ that allowed the model to internalize not only a ‘map of human language,’ but achieve a level of adaptability—and ‘emergent’ intelligence—that ‘no one thought possible.’”

As Sacks mentioned, “distillation” is an established principle in artificial intelligence research, and it’s something that is done all the time to refine and improve the accuracy of smaller large language models. This process is so normalized in deep learning that the most often cited paper about it was coauthored by Geoffrey Hinton, part of a body of work that just earned him the Nobel Prize. Hinton’s paper specifically suggests that distillation is a way to make large language models more efficient, and that “distilling works very well for transferring knowledge from an ensemble or from a large highly regularized model into a smaller, distilled model.”

An IBM article on distillation notes “The LLMs with the highest capabilities are, in most cases, too costly and computationally demanding to be accessible to many would-be users like hobbyists, startups or research institutions … “knowledge distillation has emerged as an important means of transferring the advanced capabilities of large, often proprietary models to smaller, often open-source models. As such, it has become an important tool in the democratization of generative AI.”

In late December, OpenAI CEO Sam Altman took what many people saw as a veiled shot at DeepSeek, immediately after the release of DeepSeek V3, an earlier DeepSeek model. “It is (relatively) easy to copy something that you know works,” Altman tweeted. “It is extremely hard to do something new, risky, and difficult when you don’t know if it will work.” 

“It’s also extremely hard to rally a big talented research team to charge a new hill in the fog together,” he added. “This is the key to driving progress forward.”

Even this is ridiculous, though. Besides being trained on huge amounts of other people’s data, OpenAI’s work builds on research pioneered by Google, which itself builds on earlier academic research. This is, simply, how artificial intelligence research (and scientific research more broadly) works. 

This is all to say that, if OpenAI argues that it is legal for the company to train on whatever it wants for whatever reason it wants, then it stands to reason that it doesn’t have much of a leg to stand on when competitors use common strategies used in the world of machine learning to make their own models. But of course, it is going with the argument that it must “protect [its] IP.”

“We know PRC based companies — and others — are constantly trying to distill the models of leading US AI companies,” an OpenAI spokesperson told Bloomberg. “As the leading builder of AI, we engage in countermeasures to protect our IP, including a careful process for which frontier capabilities to include in released models, and believe as we go forward that it is critically important that we are working closely with the US government to best protect the most capable models from efforts by adversaries and competitors to take US technology.”

About the author

Jason is a cofounder of 404 Media. He was previously the editor-in-chief of Motherboard. He loves the Freedom of Information Act and surfing.

Jason Koebler

Predictive Analytics for Inventory Control: How E-commerce Giants Optimize Stock Levels
Why Indian IT is Not Keen on Building AI Foundational Models
MapmyIndia Acquires 9.37% Stake in AI Startup SimDaaS Autonomy
Epson Robots to Showcase Automation Solutions at MD&M West 2025 – AI-Tech Park
Transforming Customer Service with Multilingual NLP Technology

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Copy Link Print
Share
Previous Article Altimetrik Altimetrik Achieves AWS Advanced Tier Partner Status – AI-Tech Park
Next Article aim interview 6 2.jpg 1 DeepSeek’s Dramatic Dominance: Everything You Need to Know
Leave a Comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1kLike
69.1kFollow
134kPin
54.3kFollow

Latest News

Will AI Replace Human Creativity in Content Creation? Exploring Both Sides
Will AI Replace Human Creativity in Content Creation? Exploring Both Sides
Chart showing the positive correlation between ethical AI practices and user trust
Putting Users First: Ethical Guidelines for AI-Powered Interfaces
The Role of AI in Crafting Smarter, More Intuitive Interfaces
The Role of AI in Crafting Smarter, More Intuitive Interfaces
AI for Newbies: Top Development Tools to Kickstart Your Coding Journey
AI for Newbies: Top Development Tools to Kickstart Your Coding Journey

You Might also Like

img ZpBXGvyTKWyGMEnsVA2f95ar
AI SEO NewsAI Technology

Machine learning for search query volume prediction

9 Min Read
img G3ToFXJtFIWiG918MCSdTp2W png
AI StategiesAI Technology

Enhancing Customer Experience

8 Min Read
img YnO2ltIRRLDqbbcbyvwz19hr
AI StategiesAI Technology

Automated content curation for ecommerce blogs

14 Min Read
//

Empowering your SEO journey, one keyword at a time. Unlock your site’s full potential with smart SEO solutions.

Quick Link

  • About the Blog
  • Meet the Team
  • Guidelines
  • Our Story
  • Press Inquiries
  • Contact Us
  • Privacy Policy

Support

  • Help Center
  • FAQs
  • Submit a Ticket
  • Reader’s Guide
  • Advertising
  • Report an Issue
  • Technical Support

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

KeyWordKingsKeyWordKings
Follow US
© 2025 KeywordKings. All Rights Reserved.
  • About
  • Contact
  • Privacy Policy
  • T&C’s
  • Articles
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?