Key takeaways
OpenAI suffered a major setback in ongoing copyright litigation when a federal judge ruled the artificial intelligence company must hand over internal communications about its deletion of massive datasets containing pirated books.
The decision, issued by U.S. District Judge Ona Wang in the Southern District of New York, threatens to expose OpenAI to billions of dollars in potential damages and represents a significant victory for authors and publishers suing the company.
The case centers on two datasets called "Books1" and "Books2" that OpenAI allegedly used to train its ChatGPT language model.
These datasets reportedly contained books from Library Genesis, a pirate library offering free access to copyrighted works.
According to court filings, OpenAI deleted both datasets in 2022, with the company's lawyers stating they were removed "due to their non-use."
Judge rules OpenAI waived attorney-client privilege
In her ruling, Judge Wang determined that OpenAI waived its attorney-client privilege claims by selectively disclosing reasons for deleting the datasets.
"OpenAI has waived privilege by making a moving target of its privilege assertions," Wang wrote in the decision. "OpenAI has gone back and forth on whether 'non-use' as a 'reason' for the deletion of Books1 and Books2 is privileged at all. OpenAI cannot state a 'reason' (which implies it is not privileged) and then later assert that the 'reason' is privileged to avoid discovery."
The ruling grants authors and publishers access to employee communications, including Slack messages, about the dataset deletion.
OpenAI's in-house legal team will also face depositions about their motivations for removing the data. OpenAI appealed the decision shortly after it was issued, though a separate request for attorney-client communications between OpenAI's legal counsel remains pending.
Billions at stake as willful infringement looms
The implications of Judge Wang's decision extend far beyond standard copyright disputes.
If the disclosed communications reveal that OpenAI knowingly infringed on copyrighted material, the company could face charges of willful infringement.
Under copyright law, willful infringement carries statutory damages of up to $150,000 per work, compared to $750 for standard infringement. With tens of millions of books and articles potentially at issue, the financial exposure could reach into the billions.
David Schultz, a professor at Hamline University, emphasized the significance of accessing attorney communications.
"Finding out what attorneys said or what clients said to attorneys and back and forth probably gives us a lot of evidence regarding state of mind," Schultz said. He added that the disclosure would be an "enormous" blow to OpenAI's defense.
The case follows a similar pattern to Anthropic's settlement with authors in August 2024.
Anthropic, another AI company, agreed to pay $1.5 billion to settle a class action lawsuit after authors accused it of training its Claude language model on pirated books from the same Library Genesis source.
According to court documents, Anthropic cited "inordinate pressure" to avoid a trial that could have resulted in up to $1 trillion in damages.
The consolidated lawsuit involves major publishers and authors
The lawsuit against OpenAI represents a consolidation of multiple copyright cases filed in different federal courts.
The Authors Guild, a professional organization for writers, filed the first case in September 2023, joined by 17 prominent authors including George R.R. Martin, John Grisham, Jonathan Franzen, Jodi Picoult, and Elin Hilderbrand.
The New York Times filed a separate lawsuit in December 2023, accusing OpenAI of using its articles to train chatbots without permission.
In October 2024, U.S. District Judge Sidney Stein, who oversees the consolidated cases, ruled that authors could proceed with claims that ChatGPT's output summaries infringe their copyrights.
Judge Stein found that summaries of works like Martin's "Game of Thrones" series were potentially similar enough to the original books to constitute copyright violation.
The discovery battle over OpenAI's internal communications represents just one aspect of extensive pretrial proceedings.
Authors and publishers have already gained access to some employee Slack messages discussing the dataset deletion, but they continue pushing for broader disclosure of attorney communications and decision-making processes.
Read more:
Bezos’s Project Prometheus acquires AI startup General Agents
California Prosecutors Admit Using Flawed AI In Criminal Filings
OpenAI Confirms Data Breach Through Mixpanel Exposing API User Information