The Internet's Historical Archives Are In Danger, And AI Is To Blame
The arrival of Artificial Intelligence (AI) chatbots has had various impacts on our lives. Nowadays, if you want information about something, you can ask an AI chatbot, and you'll have an answer in just a few seconds. Some are also using AI to get work done, although you shouldn't trust AI chatbots that much, as many experts don't. However, while it has made some aspects of our lives easier, AI is also causing havoc in different ways. For example, AI is making some of our everyday items expensive due to the ever-increasing demand for memory and storage in data centers that power the technology.
Additionally, AI is also putting the Internet Archive, the world's go-to digital historical archive for content on the web, in jeopardy. Founded in 1996 as a non-profit organization, the Internet Archive is by far the world's largest digital library, created with a mission to preserve the web and provide universal information access to everyone. It is hence a valuable source for finding any past version of content on the web and even recovering those that have since been deleted from the original source.
To do its preservation work, the Internet Archive uses crawlers to capture snapshots of web pages and makes that content searchable through the Wayback Machine. However, due to AI, the Internet Archive is now in massive danger and is facing its biggest challenge yet, which could make it less resourceful in the future. According to an investigation by Nieman Lab, some websites are now blocking the Internet Archive's crawlers, viewing them as a backdoor through which AI companies scrape their content without permission.
Dozens of websites have blocked the Internet Archive's crawlers
Nieman Lab reports that several publishers have restricted the Internet Archive's crawlers, including Financial Times, The New York Times, The Athletic, and The Guardian. In total, the investigation found that 241 news sites from nine countries, including the U.S., had blocked at least one of the Internet Archive's bots by including it in the disallow list of their robots.txt file, instructing bots which parts to crawl and which to avoid. However, while the general concern of AI bots using the Internet Archive's content as a backdoor is valid, such restrictions will put the organization's mission of democratizing information at risk.
"If publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record," Internet Archive founder Brewster Kahle told Nieman Lab. In support of this mission, The Guardian has only filtered its article pages from the crawlers, but other sections can still be scraped. However, some sites like The New York Times and The Athletic have taken a firm stance by "hard blocking" the crawlers and adding the Internet Archive's bot to the disallow list of their respective websites' robots.txt files.
Reddit also restricted the Internet Archive's access in August 2025, which is proof that news publishers aren't the only ones blocking archiving crawlers. Furthermore, the hard stance against content preservation by publishers hasn't only been applied to the Internet Archive. Another non-profit internet preservation project called Common Crawl has also been affected, as per the investigation, with 240 sites out of the 241 disallowing the organization's bots. If this trend continues, you might not be able to see deleted Reddit posts, tweets on X, or even articles on news sites.