- AI
- A
How an AI startup plans to scan and dispose of millions of books
Court documents reveal how companies raced to acquire more books to feed chatbots: among other things, they bought, scanned, and destroyed millions of copies
At the beginning of 2024, the leaders of Anthropic, an AI development startup, took on an ambitious project while simultaneously trying to keep it secret. “Project Panama is our attempt to conduct destructive scanning of all the books in the world,” the internal plan, declassified in court documents last week, stated. “We do not want the fact of our activities to become known.”
According to the documents, about a year later, a budget of tens of millions of dollars was allocated for this goal. This money was spent to acquire books and cut off their spines, then scan the pages and inject more knowledge into the AI models underlying products like the popular chatbot Claude.
Details of Project Panama had not been previously reported. The facts surfaced in over 4,000 pages of court documents in a copyright infringement case filed by authors against Anthropic, a company valued at $183 billion. In August, Anthropic agreed to pay $1.5 billion to settle the dispute as part of a settlement agreement. But when a district judge decided last week to 1 unseal a whole batch of documents from the case, Anthropic's zeal in acquiring books became apparent.
New documents (and earlier materials from other lawsuits by authors against AI companies) show the extremes to which tech firms like Anthropic, Meta2, Google, and OpenAI went to obtain massive datasets to train their software.
The case against Anthropic is part of a wave of lawsuits filed by authors, artists, photographers, and news organizations against AI companies. As the court materials show, tech giants are frantically and sometimes secretly participating in a race to acquire humanity's intellectual heritage.
If we believe the court documents, books are considered a key trophy for companies. In an Anthropic document from January 2023, one of the co-founders suggested that training AI models on books could teach them to "write well" rather than imitate "low-quality internet speech." In a letter within Meta2 from 2024, access to a digital array of books was referred to as a "mandatory" condition to remain competitive against rivals.
However, the materials indicate that companies did not consider it practically feasible to obtain direct permission from publishers and authors to use their works. Instead, the documents claim, Anthropic, Meta2, and others found ways to acquire books in bulk without notifying the authors. There are also mentions of downloading pirated copies.
In several cases, Meta2 employees expressed concerns in internal messages that downloading collections from millions of books without the necessary permissions was a violation of copyright. In this lawsuit by book authors against the company, it is stated: in internal correspondence from December 2023, it was mentioned that the practice was approved after "escalation to MZ" — likely referring to CEO Mark Zuckerberg. Meta2 declined to comment for this publication.
In one of the recently released documents, Anthropic reported that company co-founder Ben Mann personally downloaded fiction and non-fiction from LibGen, a shadow library with books and other copyright-infringing content, for 11 days in June 2021. The case includes a screenshot of his browser where he is downloading files using file-sharing programs.
In July 2022, Mann enthusiastically commented on the launch of a new site called Pirate Library Mirror. The site claimed to have an enormous database of books and stated: "We consciously violate copyright in most countries." Mann sent his fellow anthropic colleagues a link to the site with the note: "perfect timing!!!"
In court, Anthropic stated that the company has never trained a commercial and revenue-generating artificial intelligence model on LibGen data and has never used the Pirate Library Mirror to train a full AI model.
Ed Newton-Rex is a composer and former top executive in the AI field, now the head of a nonprofit advocating for content creators' rights. According to Newton-Rex, the published documents clearly show that AI companies owe authors much more than they have paid so far. "We urgently need a reboot of the entire AI industry so that content creators start receiving fair compensation for their vital contributions," he said.
Google, Microsoft, and OpenAI, the owner of the ChatGPT site, are also facing copyright lawsuits from writers with similar accusations. (Disclosure: The Washington Post has a content partnership with OpenAI).
Most lawsuits against AI companies are still ongoing. According to James Grimmelmann, a professor of digital and information law at Cornell Tech, the legal questions raised remain unresolved. However, in two earlier rulings, judges found that using books to train AI models without the permission of the author or publisher may be legal under the copyright doctrine known as "fair use".
In June, District Judge William Alsup ruled that Anthropic had the right to use books to train artificial intelligence models because they process the material in a "transformative" way. He compared the AI training process to how teachers "teach students to write well." In the same month, District Judge Vince Chhabria ruled in the Meta case2 that: authors could not demonstrate that the company's AI models could harm the sales of their books.
However, the method of obtaining books can still create problems for companies. In the case of Anthropic, the court deemed the book scanning project acceptable. Nevertheless, the judge ruled that the company may have violated copyright when it downloaded millions of pirated books without payment (even before launching Project Panama).
Olsap granted the case class-action status for authors whose books were in two shadow libraries (huge unauthorized online collections of digitized books) that Anthropic downloaded and saved for later use. Without going to trial, the company agreed to pay $1.5 billion to publishers and authors without admitting guilt. Authors whose books were downloaded can claim their share of the payout; it is estimated at about $3,000 per work.
“The case is settled, but the key court ruling from June 2025 remains in effect,” wrote Anthropic's Deputy General Counsel Aparna Shridhar in a comment to the Washington Post. “Judge Olsap ruled that AI training is 'in its essence transformative': Anthropic's AI models were trained on works not to 'replicate or displace them, but to move away from the originals and create something different.' The settled issue was how certain materials were obtained, not whether we could use them to develop” AI models.
Buy, cut, scan, and recycle
When the Project Panama initiative to purchase and scan physical books was just starting, Anthropic reached out to a Silicon Valley veteran. The company hired Tom Turvey, a leader at Google, who two decades earlier had helped create the famous but legally contentious Google Books project.
As indicated by the case materials, initially, Anthropic considered the possibility of purchasing books from libraries or used bookstores. For example, they wanted to buy books from Strand, a famous New York store that often boasts a slogan about 18 miles of new and used books3. According to a document describing a meeting on content acquisition in March 2024, the store was “interested in providing used books.”
Anthropic employees also discussed the option of either reaching out to libraries in the U.S., including the New York Public Library4, or, as stated in the documents, “a new library that is chronically underfunded.”
It is unclear which of these proposals, if any, Anthropic implemented. In response to an email inquiry, a representative from Strand stated that ultimately no books were sold to the company. The New York Public Library did not respond to a request for comment.
Ultimately, Anthropic acquired millions of books, often in batches of tens of thousands, according to the case materials. Key players in this included book networks, such as the used book retailer Better World Books and the British company World of Books.
The court documents do not disclose the total number of scanned books and their cost. However, in a project proposal from one contractor that eventually worked with Anthropic, it was noted that the AI company “is seeking an experienced document scanning service provider to convert between 500,000 and 2 million books over a six-month period.”
Better World Books and World of Books did not respond to requests for comment on Monday.
The document outlines what the scanning company will do. A “hydraulic cutting machine” will “carefully cut” the books; then the pages “will be scanned on high-speed, high-quality production scanners.” And finally, the document states, the contractor “will coordinate the disposal of the spent books with a waste management company.”
“Somehow not right”
As the documents from the proceedings against Meta2 in copyright cases show, the social media giant was eager for new data and was willing to take legal risks to obtain it. The judge in the case, Vince Chhabria, sided with Meta2 regarding the use of books for training AI models. However, he also allowed authors to continue pursuing claims that Meta2 illegally distributed copies of pirated books. In the Northern District of California, the plaintiffs are seeking class action status for these claims.
In the lawsuit, the authors claim that Meta2 management considered paying for books to train AI models but instead decided to download millions of books for free from torrent platforms for online piracy. The structure of such platforms often encourages users who upload new content: they can download large sets of files more quickly.
Internal documents (some of which have already been subject to publications) show that Meta2 employees expressed concerns that their actions were risky or wrong and discussed how to cover their tracks.
“Downloading torrents on a work laptop feels wrong,” wrote one engineer in 2023, according to the documents. Later, this same employee shared with the company’s lawyers the concern that using torrent sites could mean distributing pirated works to others, and that this “could be illegal.”
A letter from December 2023 from court documents clearly shows that the use of LibGen was approved — apparently by Zuckerberg, who was identified by initials. “After a previous escalation to MZ, the GenAI department was allowed to use LibGen for Llama 3 […] with a number of agreed-upon risk mitigation measures,” the letter stated, after which it listed the legal and political risks of using the data.
“If publications come out in the media stating that we used a dataset and knew it was pirated (LibGen, for example), this could undermine our negotiating position with regulators on such matters,” the letter continued to explain.
As internal correspondence showed, by April 2024 the company was moving towards downloading LibGen and other shadow libraries. Chat logs demonstrate how one employee asked another to clarify why rented servers from Amazon were being used for torrent downloads instead of those owned by Facebook2. The response: “To reduce the risk that the activity could be traced” back to the company.
In a document filed last month, Meta's lawyers2 wrote that the company “denies distributing the plaintiffs' works when it downloaded training data […] using torrents.”
In another lawsuit from 2023, authors of books accused OpenAI and Microsoft: the companies allegedly violated copyright when hunting for books to train AI. OpenAI, where Mann and Anthropic CEO Dario Amodei worked before founding the startup, acknowledged the fact of downloading from LibGen but informed the court that it deleted the files before the release of ChatGPT.
“OpenAI started this trend, which ultimately led to rampant piracy among companies in the artificial intelligence sector and predatory extraction of all human creativity,” stated Justin A. Nelson, a lawyer from Susman Godfrey LLP, representing the authors of books both in cases against OpenAI and in cases against Anthropic. OpenAI declined to comment for this publication.
Earlier this month, two major publishers asked the court to allow them to join a group of writers and illustrators in a copyright lawsuit against Google, which was originally filed in 2023.
Grimmelmann, a professor at Cornell Tech, argues that AI companies have "self-convinced themselves with faulty logic" regarding the use of copyrighted data. The breakthroughs underlying ChatGPT and similar tools began in academic research, where the use of copyrighted materials for training is generally considered acceptable, he noted. However, as Grimmelmann states, researchers retained this practice even when they began profiting from AI models.
"By the time tensions around copyright flared up, they were already invested in embedding copyrighted data into their pipelines and found themselves caught in a fast-paced, high-stakes race to release ever newer and more sophisticated models," Grimmelmann stated.
He added that Anthropic's decision to start acquiring and scanning physical books instead of downloading shadow libraries "ultimately turned out to be a smart move." "This is a good example of how a company chose a more measured approach and achieved legal compliance," he said.
Translator's Notes
The Washington Post article is written in English and the grammatical gender is not indicated, but the decision was made by Judge Araceli Martinez-Olguin. ↑
The transnational holding Meta, which owns the social network Facebook, is an extremist organization, and its activities are prohibited. ↑a ↑b ↑c ↑d ↑e ↑f
The slogan about "18 miles of books" from Strand Bookstore is widely spread; it appears not only on signs and in promotional materials but also on the store's branded clothing with bags. In reality, the total length of the bookshelves has long surpassed this 29 km, while a more accurate expression of "over 23 miles" is not used. ↑
The text talks a lot about the New York bookstore and its largest system of public libraries. Although the article may give the impression otherwise, Anthropic's headquarters is in San Francisco, on the West Coast of the USA, not in New England. The company does have a small branch in New York (from 930 to 1,860 m² of office space), although last month there were talks about opening a huge office with an area of at least 23,000 m². ↑
Write comment