It turns out you can train AI models without copyrighted material

AI researchers have demonstrated that large language models can be trained using only public domain and open-source materials, challenging the assertion that such data is essential for AI development. They developed an 8 TB dataset sourced from publicly available resources, including books from the Library of Congress, and trained a seven-billion-parameter model that performed comparably to existing models like Meta's Llama 2-7B. Despite its potential, the model was less powerful due to manual annotation and legal complexities in sourcing data. This study offers a more ethical alternative for AI development but faces challenges in practical implementation.

Summary