Update: July 18, 2024, 4:44 PM EDT Salesforce provided a comment to Mashable in response to Wired’s report.
A new report claims that tech giants including Apple, Nvidia, Anthropic, and Salesforce used data from “thousands of YouTube videos” to train their AI. An investigation conducted by Proof News and published in Wired found that captions from 173,000 YouTube videos were used in the companies’ AI models.
The dataset, called “YouTube Subtitles,” includes transcripts of videos from educational channels like Khan Academy, MIT, and Harvard, as well as The Wall Street Journal, NPR, and the BBC. Material was also found from YouTube stars like PewDiePie, Marques Brownlee, and MrBeast.
Anthropic has been contacted for comment but has not yet responded, but Apple and Salesforce have issued responses to Wired’s report.
Read also: The 5 most useful features in iOS 18 public beta and how to use them
Will Apple use this data for Apple Intelligence or other AI services?
The short answer is no, but here’s a longer answer for those of you who aren’t into TLDR:
In an email to Mashable, Apple said that its open source language model, OpenELM, does indeed use the dataset, but not in the way that some might think.
The OpenELM project is part of Apple’s ongoing efforts to benefit the broader research community. In other words, according to Apple, the OpenELM model was created for research purposes only and does not form the basis of Apple’s machine learning-powered hardware or AI services, including Apple Intelligence.
Mashable Lightspeed
For the uninitiated, Apple Intelligence is the company’s new suite of AI capabilities that was announced at WWDC 2024, the annual event where the company unveils its upcoming software products, including iOS and iPadOS.
For example, Apple Intelligence summarizes text, like emails and text messages, to help you communicate more quickly with friends, loved ones and colleagues. It also powers entertainment-focused features like Genmoji, which generates new iOS emoji with prompts, and Image Playground, which lets users create AI-generated images on the fly.
New Genmoji feature coming to iOS 18. Credit: Apple
When it comes to consumer AI utilities, Apple highlighted that it offers websites the option to opt out of having their content used for AI training, and Apple assured that its generative models are built and fine-tuned using high-quality data, including content licensed from publishers and stock image companies, as well as data publicly available on the web.
Simply put, Apple isn’t denying that its open source language model, OpenELM, used the dataset, but it wants to be clear that it won’t use it as the foundation for any of its AI services, including Apple Intelligence.
Salesforce claims academic use
Salesforce also shared its perspective in an email to Mashable.
“The Pile dataset referenced in the research paper was used to train AI models for academic research purposes in 2021,” a Salesforce representative said. “The dataset is publicly available and was released under a permissive license.”
What does Nvidia say?
We reached out to Nvidia for comment, but the company, known for incorporating AI into many of its gaming hardware and services, declined to provide a statement.
We’ll update this post if we hear anything from Anthropic.
Topics Apple Artificial Intelligence