Who owns the words ChatGPT writes? That’s a question the New York Times, Chicago Tribune, John Grisham, Stacy Schiff, Kai Bird, and numerous other authors and newspapers want to know — and they’re willing to go to court to find the answer.
We’re now in the phase of innovation where tech collides with the real world, and the court system is called upon to decide how existing laws and regulations apply. In the meantime, what does that mean for corporations trying to use ChatGPT and similar chatbots?
Here’s what you need to know about the potential copyright concerns of generative AI and how to mitigate the legal risk to your enterprise.
What Are the Lawsuits Against OpenAI Alleging?
Let’s start by looking at how ChatGPT works. In a nutshell, generative AI models train colossal datasets to learn patterns and establish relationships in text data. Developers feed diverse and rich data sources, which enhances ChatGPT’s ability to perform common sense reasoning and predict text. Using a filtered version of Common Crawl, a repository representing a copy of the internet, they scaled up the power of their large language model (LLM) and triggered the generative AI boom.
The New York Times was one of the top 15 most used domains by volume, which caught the attention of The Gray Lady even though it is a relatively small portion of the model. Upon further investigation, the news corporation concluded that the commercial success of the AI startup was built upon infringing on the intellectual property of authors and publishers.
This and other lawsuits claimed that ChatGPT summarized the unlicensed works of those it trained on and at worst repeated some sentences verbatim. OpenAI claims those instances of exact copies were elicited by The New York Times exploiting a bug in ways “normal people do not use OpenAI’s product” and that they will address the bug.
The various lawsuits are asking for undisclosed monetary damages, and in some cases, for OpenAI and Microsoft (they are heavily invested in the startup) to destroy GPT models trained on unauthorized and unlicensed materials. It’s no overstatement to say this could make or break generative AI.
Could OpenAI Argue Training Data Falls Under Fair Use?
There’s a long history of copyright protection in the United States, starting with the Constitution and reaffirmed by The Copyright Act of 1790. However, some of the revisions to the original law have expanded the fair use doctrine, which does allow for some protected use of a copyright holder’s work without their permission. OpenAI suggests their use of copyrighted materials falls under “longstanding fair use principles”:
“Copyright is not a veto right over transformative technologies that leverage existing works internally—i.e., without disseminating them—to new and useful ends, thereby furthering copyright’s basic purpose without undercutting authors’ ability to sell their works in the marketplace.”
That’s a stark statement, but does it reflect the actual law? Here’s what the U.S. Copyright Office has to say about fair use:
“Under the fair use doctrine of the U.S. copyright statute, it is permissible to use limited portions of a work including quotes, for purposes such as commentary, criticism, news reporting, and scholarly reports. There are no legal rules permitting the use of a specific number of words, a certain number of musical notes, or percentage of a work.”
When OpenAI started as a non-profit organization, the scholarly reports angle might have applied, but its transition to a capped-profit structure around the time of ChatGPT might undercut that argument. Schools, libraries, and non-profit institutions have an easier time making that case, but internet service providers are shielded from infringement suits covered under the Digital Millennium Copyright Act (DMCA).
A clearer predictor of whether OpenAI will be able to effectively claim fair use is the four-factor test the Copyright Office uses, which includes:
- The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes
- The nature of the copyrighted work
- The amount and substantiality of the portion used in relation to the copyrighted work as a whole
- The effect of the use upon the potential market for or value of the copyrighted work
Though the courts will make the final decision, the fair use argument might struggle on that last bullet point, especially as AI overviews are creeping into search engine results. The ability of the plaintiff to prove or disprove the effect of ChatGPT on the potential market for newspaper subscriptions will determine the fate of this innovation, but only time will tell what will happen.
How Can You Mitigate Your Risk with ChatGPT
So, should your organization put a pause on generative AI? Definitely not, but it’s important to protect your organization from litigation down the road.
If you intend to copyright any text or images, you want to make sure that you are not using purely AI-generated content. The U.S. Copyright Office ruled the author of a comic book called Zarya of the Dawn could copyright her story, but not the Midjourney generated images contained throughout. Generative AI can be used in the creative process, but it cannot be the sole artist. The U.S. Copyright Office has already stated that works created by non-humans cannot be copyrighted (you can thank this photogenic crested macaque). Instead, the following activities are much less risky from a content generation standpoint:
- Brainstorming
- Prototyping
- Proofing Content
- Creating Internal Presentations
Be careful with any source code created by generative AI as well. There is another lawsuit against GitHub, OpenAI, and Microsoft alleging the developer platform violated their own required permission and attribution rules when they used copyrighted code to train their Codex and Copilot AI coding tools. So, taking source code “word-for-word” from generative AI sources might put your applications at risk, but you can use ChatGPT to debug source code.
Alternatively, some organizations might be able to reduce the risk of generative AI by training the technology on their own internal data. However, it’s important for developers to tread with caution and seek proper guidance to adhere to security and privacy best practices. Your organization might be at risk of legal action if your LLM is trained on or reveals personally identifiable information (PII) of employees or team members.
In all instances, it’s important to incorporate the expertise of both artificial intelligence experts and legal experts. By combining both their perspectives, you can reduce the risk to your organization and verify that your IP is actually yours.
If you’re looking to stay competitive with artificial intelligence, we have you covered. Watch our webinar, “Decoding AI: Ethics, Efficiency, and the Candidate Experience.”