Microsoft and OpenAI are reportedly investigating whether Chinese AI startup DeepSeek improperly accessed and utilised data from OpenAI’s models to develop its own AI system. This investigation centres on the technique known as “distillation”, where a smaller model is trained using the outputs of a larger, more advanced model.
The report added that such unauthorised use could potentially violate OpenAI’s terms of service and raise concerns about intellectual property theft. Many in the industry have called out the irony in this development since OpenAI used the internet to train its models without obtaining permission from the authors and creators.
David Sacks, Trump’s AI and crypto advisor, appeared on Fox News on Tuesday and said there is significant evidence that the Chinese AI firm DeepSeek used a technique called “distillation” to extract knowledge from OpenAI’s models. Sacks likened this process to theft.
This development is surprising, considering that Microsoft recently announced it is making DeepSeek R1 available on Azure AI Foundry and the GitHub model catalogue, expanding the platform’s AI portfolio.
“Customers will soon be able to run DeepSeek R1’s distilled models locally on Copilot+ PCs, as well as on the vast ecosystem of GPUs available on Windows,” said Microsoft chief Satya Nadella.
He further added that DeepSeek had introduced real innovations, some of which even OpenAI discovered in o1. “Now, of course, those innovations are becoming commoditised and will be widely used,” he said.
Is DeepSeek in Trouble?
“OpenAI scrapes the internet and trains a model with everyone’s data with impunity and without asking for permission — All good. DeepSeek distils OpenAI models to train their own — outrageous! You gotta have balls to consider this “proprietary data”,” said Santiago Valdarrama, founder of Tideily.
With DeepSeek’s rising popularity and its open-source nature, many are touting it as the “Robinhood of AI”.
Pratik Desai, founder of KissanAI, explained to AIM that DeepSeek was being called so because it has returned stolen public data to the public through open-source models. He further explained that ‘distillation’ is a common machine learning technique that transfers the knowledge of a large pre-trained model, the ‘teacher model,’ to a smaller ‘student model’.
Vin Vashishta, founder of V Squared, too questioned if DeepSeek’s decision to open-source its LLM meant that it returned the data that OpenAI is alleged to have used to train its models to the original owners.
“A company that made its name regurgitating and recombining sliced-up bits of intellectual property in statistically probable ways (sometimes verging on plagiarism) without due compensation is now … whining about …. another company apparently doing the same, at lower cost,” said AI critique Gary Marcus in a post on X.
This comes amid the alleged suicide of OpenAI’s whistleblower, Suchir Balaji, who accused the company of unethical practices. In August 2023, Balaji resigned from OpenAI, citing concerns over the company’s business practices. He publicly raised ethical concerns about OpenAI’s operations, particularly regarding copyright issues.
In an October 2023 interview with The New York Times, Balaji alleged that OpenAI had violated US copyright laws while developing ChatGPT. OpenAI is treading a very complicated path here. The startup has openly acknowledged the use of publicly available internet data to train its models.
According to their official documentation: “OpenAI’s foundation models, including the models that power ChatGPT, are developed using three primary sources of information: (1) information that is publicly available on the internet, (2) information that we partner with third parties to access, and (3) information that our users or human trainers and researchers provide or generate.”
It would now be hypocritical of OpenAI to accuse DeepSeek of using OpenAI’s models. For instance, Mira Murati, former chief technology officer of OpenAI, found herself at the centre of controversy last year over the training data used for Sora, OpenAI’s new text-to-video AI model.
During an interview with The Wall Street Journal, Murati was asked about the specific sources of data used to train Sora. She revealed that the model was trained on “publicly available and licensed data”.
However, when asked whether content from platforms like YouTube, Instagram, or Facebook was used to train the model, she responded with uncertainty, saying, “I’m actually not sure about that. I’m not confident about it.”
OpenAI is currently involved in an ongoing lawsuit with The New York Times and other publishers, who sued OpenAI and Microsoft in late 2023, accusing them of copyright infringement. The lawsuit claims that OpenAI trained ChatGPT using millions of the publication’s articles without obtaining permission.
In India, the company is facing a significant copyright infringement lawsuit filed by Asian News International (ANI) in the Delhi High Court in November 2024. ANI’s lawsuit alleges that OpenAI used its published content without permission to train ChatGPT.
The case has attracted significant attention from other media organisations in India, including those owned by prominent business figures Gautam Adani and Mukesh Ambani, who have joined the legal proceedings against OpenAI.
OpenAI is Not Alone
Meta is facing a class-action lawsuit filed by authors like Richard Kadrey, Sarah Silverman, and Ta-Nehisi Coates, accusing the company of using copyrighted material without permission to train its Llama models. The lawsuit, Kadrey v. Meta, is being heard in the US District Court for the Northern District of California.
Internal documents suggest Meta used pirated content from the controversial site LibGen to train its models, allegedly with CEO Mark Zuckerberg’s approval, despite legal concerns.
On the other hand, Google says that its foundational language models are trained primarily on publicly available, crawlable data from the internet. The company gives publishers control over how their sites are used with Google-Extended, a tool that web publishers can use to manage whether their sites help improve Gemini Apps and Vertex AI generative APIs.