Close Menu
  • Home
  • AI
  • Business
  • Market
    • Media
      • News
    • Politics
  • Sports
  • USA
  • World
    • Local
  • Breaking News
  • Health
  • Entertainment & Lifestyle

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated

What's Hot

Startups Weekly: Tech and the law

Gavin Newsom Files $787 Million Defamation Lawsuit Against Fox News

Trump reacts to Supreme Court curbing birthright citizenship injunctions

Facebook X (Twitter) Instagram
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
Facebook X (Twitter) Instagram
BLMS Media | Breaking News, Politics, Markets & World Updates
  • Home
  • AI
  • Business
  • Market
    • Media
      • News
    • Politics
  • Sports
  • USA
  • World
    • Local
  • Breaking News
  • Health
  • Entertainment & Lifestyle
BLMS Media | Breaking News, Politics, Markets & World Updates
Home » EleutherAI releases massive AI training dataset of licensed and open domain text
AI

EleutherAI releases massive AI training dataset of licensed and open domain text

BLMS MEDIABy BLMS MEDIAJune 6, 2025No Comments3 Mins Read
Share Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Copy Link
Follow Us
Google News Flipboard
Share
Facebook Twitter LinkedIn Pinterest Email Copy Link


EleutherAI, an AI research organization, has released what it claims is one of the largest collections of licensed and open-domain text for training AI models.

The dataset, called The Common Pile v0.1, took around two years to complete in collaboration with AI startups Poolside, Hugging Face, and others, along with several academic institutions. Weighing in at 8 terabytes in size, The Common Pile v0.1 was used to train two new AI models from EleutherAI, Comma v0.1-1T and Comma v0.1-2T, that EleutherAI claims perform on par with models developed using unlicensed, copyrighted data.

AI companies, including OpenAI, are embroiled in lawsuits over their AI training practices, which rely on scraping the web — including copyrighted material like books and research journals — to build model training datasets. While some AI companies have licensing arrangements in place with certain content providers, most maintain that the U.S. legal doctrine of fair use shields them from liability in cases where they trained on copyrighted work without permission.

EleutherAI argues that these lawsuits have “drastically decreased” transparency from AI companies, which the organization says has harmed the broader AI research field by making it more difficult to understand how models work and what their flaws might be.

“[Copyright] lawsuits have not meaningfully changed data sourcing practices in [model] training, but they have drastically decreased the transparency companies engage in,” Stella Biderman, EleutherAI’s executive director, wrote in a blog post on Hugging Face early Friday. “Researchers at some companies we have spoken to have also specifically cited lawsuits as the reason why they’ve been unable to release the research they’re doing in highly data-centric areas.”

The Common Pile v0.1, which can be downloaded from Hugging Face’s AI dev platform and GitHub, was created in consultation with legal experts, and it draws on sources, including 300,000 public domain books digitized by the Library of Congress and the Internet Archive. EleutherAI also used Whisper, OpenAI’s open source speech-to-text model, to transcribe audio content.

EleutherAI claims Comma v0.1-1T and Comma v0.1-2T are evidence that the Common Pile v0.1 was curated carefully enough to enable developers to build models competitive with proprietary alternatives. According to EleutherAI, the models, both of which are 7 billion parameters in size and were trained on only a fraction of the Common Pile v0.1, rival models like Meta’s first Llama AI model on benchmarks for coding, image understanding, and math.

Parameters, sometimes referred to as weights, are the internal components of an AI model that guide its behavior and answers.

“In general, we think that the common idea that unlicensed text drives performance is unjustified,” Biderman wrote in her post. “As the amount of accessible openly licensed and public domain data grows, we can expect the quality of models trained on openly licensed content to improve.”

The Common Pile v0.1 appears to be in part an effort to right EleutherAI’s historical wrongs. Years ago, the company released The Pile, an open collection of training text that includes copyrighted material. AI companies have come under fire — and legal pressure — for using The Pile to train models.

EleutherAI is committing to releasing open datasets more frequently going forward in collaboration with its research and infrastructure partners.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleMusk could lose billions of dollars in spat with Trump
Next Article ‘I Don’t Agree With It’
BLMS MEDIA
  • Website

Related Posts

Startups Weekly: Tech and the law

June 27, 2025

Big Tech lands an early win in legal battles against publishers

June 27, 2025

TechCrunch All Stage 2025: Prepare 4 VC’s Jason Kraus will instruct on how to turn chaos into momentum

June 27, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

Nova Scotia: Siblings Lily, 6, and Jack, 4, have been missing in rural Canada for four days

May 6, 202515 Views

Families of Air India crash victims give DNA samples to help identify loved ones

June 13, 20258 Views

Australia’s center-left Labor Party retains power as conservative leader loses seat, networks report

May 3, 20254 Views

These kibbutzniks used to believe in peace with Palestinians. Their views now echo Israel’s rightward shift

May 2, 20254 Views
Don't Miss

Startups Weekly: Tech and the law

By BLMS MEDIAJune 27, 20250

Welcome to Startups Weekly — your weekly recap of everything you can’t miss from the…

Big Tech lands an early win in legal battles against publishers

TechCrunch All Stage 2025: Prepare 4 VC’s Jason Kraus will instruct on how to turn chaos into momentum

TC All Stage brings back early launch prices for a limited time

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated

Our Picks

Startups Weekly: Tech and the law

Gavin Newsom Files $787 Million Defamation Lawsuit Against Fox News

Trump reacts to Supreme Court curbing birthright citizenship injunctions

Welcome to BLMS Media — your trusted source for news, insights, and stories that shape our world.

At BLMS Media, we are committed to delivering timely, accurate, and in-depth information across a wide range of topics. Whether you’re looking for breaking news, political analysis, market trends, or global developments, we bring you the stories that matter — with clarity, integrity, and perspective.

Facebook X (Twitter) Instagram Pinterest
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 blmsmedia. Designed by blmsmedia.

Type above and press Enter to search. Press Esc to cancel.