Halting Web Crawlers Used for AI Training in Their Tracks

Mon., 20. January 2025 | 6 min read

View All Articles

Meta’s Llama 2 was released in July 2023 and the models within this architecture were trained on 2 trillion tokens. Its successor, Llama 3, was released in April 2024 and it was trained on over 15 trillion tokens. AI models require a large corpus of high-quality training data to perform well on tasks. This data can be obtained by purchasing datasets, scraping the internet, and using synthetic data. Scraping the internet has led to frequent legal battles (such as The New York Times vs. OpenAI and Microsoft and Alden Global Capital vs OpenAI and Microsoft) due to copyrighted data being obtained for free and being used in commercial models. If businesses do not block data scraping then they will lose control of their data. IT leaders and Content Strategists can protect their data by blocking web crawlers …

Tactive Research Group Subscription

To access the complete article, you must be a member. Become a member to get exclusive access to the latest insights, survey invitations, and tailored marketing communications. Stay ahead with us.

Become a Client!

Halting Web Crawlers Used for AI Training in Their Tracks

Tactive Research Group Subscription

Change Your IT Strategy to Benefit from the Digital Markets Act and Empower Growth

Faux Data, Real Intelligence: Low-cost AI Model Training with Synthetic Datasets

Enhancing Software Quality Assurance with LLMs: The Influence of TestGen-LLM in Modern Testing Workflows