Here's Proof You Can Train AI Models Without Abusing Copyrighted Material


In 2023, OpenAI Told the UK Parliament that it was “impossible” to train major AI models without using copyrighted material. It's a popular stance in the AI ​​world, where OpenAI and other major players have used dirty content online to train models that power chatbots and image generators, triggering a wave of lawsuits alleging copyright infringement. Has been.

Two announcements on Wednesday provide evidence that large language models can indeed be trained using copyrighted material without permission.

A group of researchers backed by the French government has released the largest AI training dataset composed entirely of text into the public domain. And the nonprofit Fairly Trend announced that it has provided its first certification for a large language model built without copyright infringement, showing that technology like the one behind ChatGPT can be used as a tool for the AI ​​industry's controversial benchmarks. Can be made in different ways.

“There is no fundamental reason why someone couldn’t do LLM training properly,” says Ed Newton-Rex, CEO of Fairly Trained. He founded the nonprofit in January 2024 after leaving his executive role at image-generation startup Stability AI because he disagreed with its policy of scraping content without permission.

Fairly Trained offers certification to companies that want to prove that they have trained their AI models on data that they own, are licensed, or that is in the public domain. When the nonprofit launched, some critics pointed out that it had not yet identified a large language model that met those needs.

Today, Fairly Trend announced that it has validated its first large language model. It's called KL3M and was developed by Chicago-based legal tech consulting startup 273 Ventures using a curated training dataset of legal, financial and regulatory documents.

Jillian Bommarito, the company's co-founder, says the decision to train KL3M in this way stemmed from the company's concern with “risk-averse” clients such as law firms. “They are concerned about provenance, and they need to know that the output is not based on tainted data,” she says. “We are not relying on fair use.” Customers were interested in using generative AI for tasks like summarizing legal documents and drafting contracts, but didn't want to get caught up in intellectual property lawsuits like OpenAI, Stability AI and others did.

Bommarito says 273 Ventures had not worked on any large language models before, but decided to train it as an experiment. “Our test is to see if this is possible,” she says. The company has created its own training dataset, the Kelvin Legal Datapack, which contains thousands of legal documents reviewed for compliance with copyright law. .

Although the dataset is small (about 350 billion tokens, or units of data) compared to datasets compiled by OpenAI and others who have extensively scoured the Internet, Bommarito says the KL3M model performed much better than expected, Something she credits to how carefully the data was examined beforehand. “Clean, high-quality data can mean you don't need to build the model so big,” she says. Curating the dataset helps make a finished AI model specific to the task at hand. Can be found for which it is designed. 273 Ventures is now offering spots on the waiting list to clients who want to purchase access to this data.

clean sheet

Companies wishing to emulate KL3M may find more help in the future in the form of freely available violation-free datasets. On Wednesday, researchers released what they claim is the largest available AI dataset for language models composed entirely of public domain material. The Common Corpus, as it's called, is a collection of texts of approximately the same size as the data used to train OpenAI's GPT-3 text generation model and was posted on the open source AI platform Hugging Face Is.

The dataset was created from sources such as the US Library of Congress and public domain newspapers digitized by the National Library of France. Pierre-Carl Langlais, project coordinator of the Common Corpus, calls it “a corpus large enough to train cutting-edge LLMs”. In big AI parlance, the dataset contains 500 billion tokens. OpenAI's most capable models are believed to have been trained on several trillions of models.