New Bill Requires AI Companies like OpenAI to Disclose Training Data

Artificial intelligence companies may have to become a lot more transparent about how they train their models, if a new bill from Rep. Adam Schiff passes in Congress. Schiff has proposed the Generative AI Copyright Disclosure Act, which would require firms like OpenAI to list the copyrighted works they use to build generative-AI systems. The bill comes amid a growing outcry about the burgeoning industry using copyrighted materials to inform their large language models, and it’s the latest in a number of Congressional pushes to regulate the technology and protect human content creators.

“AI has the disruptive potential of changing our economy, our political system, and our day-to-day lives,” Schiff said in a statement. “We must balance the immense potential of AI with the crucial need for ethical guidelines and protections. . . . This is about respecting creativity in the age of AI and marrying technological progress with fairness.”

The bill faces a potential uphill battle in Congress, as there has been plenty of gridlock when it comes to AI legislation. Some opponents worry that regulation would slow down the technology’s pace of expansion, potentially giving countries like Russia and China an advantage. Should it pass, though, here’s what you need to know about it.

What would the Generative AI Copyright Disclosure Act require AI companies to do?

Schiff’s bill would require companies to let the government know before they launch an AI system. They’ll also be required to list “all copyrighted works used in building or altering the training dataset for that system.”

Is this bill just for new AI systems?

No. The bill’s rules would be retroactive, requiring generative-AI systems already on the market like OpenAI’s ChatGPT to disclose where they got the information they used to train their models. That’s something companies have been reluctant to discuss in general, particularly amid lawsuits from companies like the New York Times. OpenAI CTO Mira Murati recently raised eyebrows when she claimed she was unsure if the company’s Sora tool used data from YouTube, Facebook, or Instagram posts.

How far in advance would AI companies have to comply?

The bill mandates that the list of training model data be submitted at least 30 days before the AI is available to the public. Any substantial changes to the training model post-launch would also need to be reported.

What sort of penalties would AI companies face for noncompliance?

That’s unclear. The Copyright Office would determine how much the companies would be fined and the amounts would depend on the company’s size and whether it has a history of ignoring the Act. Penalties would start at $5,000, and go up from there. The Act does not put a cap on the maximum assessment that can be charged.

Would this prevent AI companies from using copyrighted work?

Not directly, but it could bring some accountability to the table. By listing the copyrighted works used for training, the copyright holders could ensure they gave permission for the use of their content and that they were compensated for that usage.

Who is backing the Generative AI Copyright Disclosure Act?

Schiff’s legislative allies haven’t lined up yet, but in the creative community, there are several big names that are supporting this act. The Recording Industry Association of America has offered its support, as has the Director’s Guild of America, Sag-AFTRA, ASCAP, and many more creative unions. (The support comes after Billie Eilish and 200 music artists signed an open letter critical of AI and calling for an end to the use of AI in music creation.)

“This bill is an important first step in addressing the unprecedented and unauthorized use of copyrighted materials to train generative-AI systems,” said Meredith Stiehm, president of the Writers Guild of America West. “Greater transparency and guardrails around AI are necessary to protect . . . creators.”

Source: Chris Morris, Fast Company
Published: 2024-04-10T19:00:00