A Proof News investigation has found that some of the world’s largest tech companies, including Apple and Nvidia, are using YouTube video transcripts to train their AI systems without the creators’ permission.
The report includes a search tool to determine whether a YouTube channel is included in the dataset, stating that “173,536 YouTube video captions from over 48,000 channels were used by major Silicon Valley companies, including Anthropic, Nvidia, Apple, and Salesforce.” YouTube channels included in the dataset include late-night shows such as “Late Show with Stephen Colbert” and “Jimmy Kimmel Live,” as well as content from popular YouTube personalities such as MrBeast, tech commentator Marques Brownlee, and PewDiePie.
According to Proof News, the dataset is part of a compilation called the Pile, created by the nonprofit EleutherAI, which explained in a 2020 research paper that the Pile contains 22 separate datasets.
Apple, Anthropic, Nvidia and EleutherAI did not immediately respond to requests for comment.
In an email to CNET, a Google spokesperson said the company stands by its previous statements on the matter and linked to a Bloomberg article from April. In the article, Google CEO Neal Mohan said he didn’t know whether OpenAI had actually used YouTube videos to train a text-to-video generator, but if so, it would violate the platform’s terms of service. He did not say whether Google itself had used videos in this way.
AI remains a key technology pursued by tech giants like Apple, Google, Microsoft, Meta, and IBM, but for the technology to advance, vast amounts of data must be ingested into AI models. Leaders in the field, including OpenAI, acknowledge that it is becoming increasingly difficult to find datasets to train AI systems. That’s why OpenAI, the creator of ChatGPT, is negotiating deals with content companies like News Corp. and Reddit to acquire content to feed its AI systems.
But information in the report suggests that tech companies like Apple and Nvidia may be gobbling down data sets that contain information that is, at least in spirit, out of line with what content creators expect from platforms like YouTube, which ostensibly ban the data mining of their videos and video transcripts.
A spokesperson for the public-interest AI startup Anthropic told Proof News that the company uses Pile to train its AI assistant, Claude, and that “Pile contains a small portion of YouTube captions.”
Spokeswoman Jennifer Martinez said: “YouTube’s terms cover direct use of its platform, which is separate from use of The Pile dataset. Any concerns about potential violations of YouTube’s terms of service should be directed to the authors of The Pile.”
As the report points out, Google itself has been accused of mining YouTube content: The company told The New York Times that its agreements with content creators allow it to use YouTube content for AI training.