Itemoids

AI

AI’s Fingerprints Were All Over the Election

The Atlantic

www.theatlantic.com › technology › archive › 2024 › 11 › ai-election-propaganda › 680677

The images and videos were hard to miss in the days leading up to November 5. There was Donald Trump with the chiseled musculature of Superman, hovering over a row of skyscrapers. Trump and Kamala Harris squaring off in bright-red uniforms (McDonald’s logo for Trump, hammer-and-sickle insignia for Harris). People had clearly used AI to create these—an effort to show support for their candidate or to troll their opponents. But the images didn’t stop after Trump won. The day after polls closed, the Statue of Liberty wept into her hands as a drizzle fell around her. Trump and Elon Musk, in space suits, stood on the surface of Mars; hours later, Trump appeared at the door of the White House, waving goodbye to Harris as she walked away, clutching a cardboard box filled with flags.

[Read: We haven’t seen the worst of fake news]

Every federal election since at least 2018 has been plagued with fears about potential disruptions from AI. Perhaps a computer-generated recording of Joe Biden would swing a key county, or doctored footage of a poll worker burning ballots would ignite riots. Those predictions never materialized, but many of them were also made before the arrival of ChatGPT, DALL-E, and the broader category of advanced, cheap, and easy-to-use generative-AI models—all of which seemed much more threatening than anything that had come before. Not even a year after ChatGPT was released in late 2022, generative-AI programs were used to target Trump, Emmanuel Macron, Biden, and other political leaders. In May 2023, an AI-generated image of smoke billowing out of the Pentagon caused a brief dip in the U.S. stock market. Weeks later, Ron DeSantis’s presidential primary campaign appeared to have used the technology to make an advertisement.

And so a trio of political scientists at Purdue University decided to get a head start on tracking how generative AI might influence the 2024 election cycle. In June 2023, Christina Walker, Daniel Schiff, and Kaylyn Jackson Schiff started to track political AI-generated images and videos in the United States. Their work is focused on two particular categories: deepfakes, referring to media made with AI, and “cheapfakes,” which are produced with more traditional editing software, such as Photoshop. Now, more than a week after polls closed, their database, along with the work of other researchers, paints a surprising picture of how AI appears to have actually influenced the election—one that is far more complicated than previous fears suggested.

The most visible generated media this election have not exactly planted convincing false narratives or otherwise deceived American citizens. Instead, AI-generated media have been used for transparent propaganda, satire, and emotional outpourings: Trump, wading in a lake, clutches a duck and a cat (“Protect our ducks and kittens in Ohio!”); Harris, enrobed in a coppery blue, struts before the Statue of Liberty and raises a matching torch. In August, Trump posted an AI-generated video of himself and Musk doing a synchronized TikTok dance; a follower responded with an AI image of the duo riding a dragon. The pictures were fake, sure, but they weren’t feigning otherwise. In their analysis of election-week AI imagery, the Purdue team found that such posts were far more frequently intended for satire or entertainment than false information per se. Trump and Musk have shared political AI illustrations that got hundreds of millions of views. Brendan Nyhan, a political scientist at Dartmouth who studies the effects of misinformation, told me that the AI images he saw “were obviously AI-generated, and they were not being treated as literal truth or evidence of something. They were treated as visual illustrations of some larger point.” And this usage isn’t new: In the Purdue team’s entire database of fabricated political imagery, which includes hundreds of entries, satire and entertainment were the two most common goals.

That doesn’t mean these images and videos are merely playful or innocuous. Outrageous and false propaganda, after all, has long been an effective way to spread political messaging and rile up supporters. Some of history’s most effective propaganda campaigns have been built on images that simply project the strength of one leader or nation. Generative AI offers a low-cost and easy tool to produce huge amounts of tailored images that accomplish just this, heightening existing emotions and channeling them to specific ends.

These sorts of AI-generated cartoons and agitprop could well have swayed undecided minds, driven turnout, galvanized “Stop the Steal” plotting, or driven harassment of election officials or racial minorities. An illustration of Trump in an orange jumpsuit emphasizes Trump’s criminal convictions and perceived unfitness for the office, while an image of Harris speaking to a sea of red flags, a giant hammer-and-sickle above the crowd, smears her as “woke” and a “Communist.” An edited image showing Harris dressed as Princess Leia kneeling before a voting machine and captioned “Help me, Dominion. You’re my only hope” (an altered version of a famous Star Wars line) stirs up conspiracy theories about election fraud. “Even though we’re noticing many deepfakes that seem silly, or just seem like simple political cartoons or memes, they might still have a big impact on what we think about politics,” Kaylyn Jackson Schiff told me. It’s easy to imagine someone’s thought process: That image of “Comrade Kamala” is AI-generated, sure, but she’s still a Communist. That video of people shredding ballots is animated, but they’re still shredding ballots. That’s a cartoon of Trump clutching a cat, but immigrants really are eating pets. Viewers, especially those already predisposed to find and believe extreme or inflammatory content, may be further radicalized and siloed. The especially photorealistic propaganda might even fool someone if reshared enough times, Walker told me.

[Read: I’m running out of ways to explain how bad this is]

There were, of course, also a number of fake images and videos that were intended to directly change people’s attitudes and behaviors. The FBI has identified several fake videos intended to cast doubt on election procedures, such as false footage of someone ripping up ballots in Pennsylvania. “Our foreign adversaries were clearly using AI” to push false stories, Lawrence Norden, the vice president of the Elections & Government Program at the Brennan Center for Justice, told me. He did not see any “super innovative use of AI,” but said the technology has augmented existing strategies, such as creating fake-news websites, stories, and social-media accounts, as well as helping plan and execute cyberattacks. But it will take months or years to fully parse the technology’s direct influence on 2024’s elections. Misinformation in local races is much harder to track, for example, because there is less of a spotlight on them. Deepfakes in encrypted group chats are also difficult to track, Norden said. Experts had also wondered whether the use of AI to create highly realistic, yet fake, videos showing voter fraud might have been deployed to discredit a Trump loss. This scenario has not yet been tested.

Although it appears that AI did not directly sway the results last week, the technology has eroded Americans’ overall ability to know or trust information and one another—not deceiving people into believing a particular thing so much as advancing a nationwide descent into believing nothing at all. A new analysis by the Institute for Strategic Dialogue of AI-generated media during the U.S. election cycle found that users on X, YouTube, and Reddit inaccurately assessed whether content was real roughly half the time, and more frequently thought authentic content was AI-generated than the other way around. With so much uncertainty, using AI to convince people of alternative facts seems like a waste of time—far more useful to exploit the technology to directly and forcefully send a motivated message, instead. Perhaps that’s why, of the election-week, AI-generated media the Purdue team analyzed, pro-Trump and anti-Kamala content was most common.

More than a week after Trump’s victory, the use of AI for satire, entertainment, and activism has not ceased. Musk, who will soon co-lead a new extragovernmental organization, routinely shares such content. The morning of November 6, Donald Trump Jr. put out a call for memes that was met with all manner of AI-generated images. Generative AI is changing the nature of evidence, yes, but also that of communication—providing a new, powerful medium through which to illustrate charged emotions and beliefs, broadcast them, and rally even more like-minded people. Instead of an all-caps thread, you can share a detailed and personalized visual effigy. These AI-generated images and videos are instantly legible and, by explicitly targeting emotions instead of information, obviate the need for falsification or critical thinking at all. No need to refute, or even consider, a differing view—just make an angry meme about it. No need to convince anyone of your adoration of J. D. Vance—just use AI to make him, literally, more attractive. Veracity is beside the point, which makes the technology perhaps the nation’s most salient mode of political expression. In a country where facts have gone from irrelevant to detestable, of course deepfakes—fake news made by deep-learning algorithms—don’t matter; to growing numbers of people, everything is fake but what they already know, or rather, feel.

Search the Hollywood AI Database

The Atlantic

www.theatlantic.com › technology › archive › 2024 › 11 › opensubtitles-ai-data-set-search › 680685

Editor’s note: This search tool is part of The Atlantic’s investigation into the OpenSubtitles data set. You can read more about this data set and how it’s been used to train AI here. Find The Atlantic's search tool for books used to train AI here.

The Hollywood AI Database

The Atlantic

www.theatlantic.com › technology › archive › 2024 › 11 › opensubtitles-ai-data-set › 680650

Editor’s note: This analysis is part of The Atlantic’s investigation into the OpenSubtitles data set. You can access the search tool directly here. Find The Atlantic's search tool for books used to train AI here.

For as long as generative-AI chatbots have been on the internet, Hollywood writers have wondered if their work has been used to train them. The chatbots are remarkably fluent with movie references, and companies seem to be training them on all available sources. One screenwriter recently told me he’s seen generative AI reproduce close imitations of The Godfather and the 1980s TV show Alf, but he had no way to prove that a program had been trained on such material.

I can now say with absolute confidence that many AI systems have been trained on TV and film writers’ work. Not just on The Godfather and Alf, but on more than 53,000 other movies and 85,000 other TV episodes: Dialogue from all of it is included in an AI-training data set that has been used by Apple, Anthropic, Meta, Nvidia, Salesforce, Bloomberg, and other companies. I recently downloaded this data set, which I saw referenced in papers about the development of various large language models (or LLMs). It includes writing from every film nominated for Best Picture from 1950 to 2016, at least 616 episodes of The Simpsons, 170 episodes of Seinfeld, 45 episodes of Twin Peaks, and every episode of The Wire, The Sopranos, and Breaking Bad. It even includes prewritten “live” dialogue from Golden Globes and Academy Awards broadcasts. If a chatbot can mimic a crime-show mobster or a sitcom alien—or, more pressingly, if it can piece together whole shows that might otherwise require a room of writers—data like this are part of the reason why.

[Read: These 183,000 books are fueling the biggest fight in publishing and tech]

The files within this data set are not scripts, exactly. Rather, they are subtitles taken from a website called OpenSubtitles.org. Users of the site typically extract subtitles from DVDs, Blu-ray discs, and internet streams using optical-character-recognition (OCR) software. Then they upload the results to OpenSubtitles.org, which now hosts more than 9 million subtitle files in more than 100 languages and dialects. Though this may seem like a strange source for AI-training data, subtitles are valuable because they’re a raw form of written dialogue. They contain the rhythms and styles of spoken conversation and allow tech companies to expand generative AI’s repertoire beyond academic texts, journalism, and novels, all of which have also been used to train these programs. Well-written speech is a rare commodity in the world of AI-training data, and it may be especially valuable for training chatbots to “speak” naturally.

According to research papers, the subtitles have been used by Anthropic to train its ChatGPT competitor, Claude; by Meta to train a family of LLMs called Open Pre-trained Transformer (OPT); by Apple to train a family of LLMs that can run on iPhones; and by Nvidia to train a family of NeMo Megatron LLMs. It has also been used by Salesforce, Bloomberg, EleutherAI, Databricks, Cerebras, and various other AI developers to build at least 140 open-source models distributed on the AI-development hub Hugging Face. Many of these models could potentially be used to compete with human writers, and they’re built without permission from those writers.

When I reached out to Anthropic for this article, the company did not provide a comment on the record. When I’ve previously spoken with Anthropic about its use of this data set, a spokesperson told me the company had “trained our generative-AI assistant Claude on the public dataset The Pile,” of which OpenSubtitles is a part, and “which is commonly used in the industry.” A Salesforce spokesperson told me that although the company has used OpenSubtitles in generative-AI development, the data set “was never used to inform or enhance any of Salesforce’s product offerings.” Apple similarly told me that its small LLM was intended only for research. However, both Salesforce and Apple, like other AI developers, have made their models available for developers to use in any number of different contexts. All other companies mentioned in this article—Nvidia, Bloomberg, EleutherAI, Databricks, and Cerebras—either declined to comment or did not respond to requests for comment.

You may search through the data set using the tool below.

Two years after the release of ChatGPT, it may not be surprising that creative work is used without permission to power AI products. Yet the notion remains disturbing to many artists and professionals who feel that their craft and livelihoods are threatened by programs. Transparency is generally low: Tech companies tend not to advertise whose work they use to train their products. The legality of training on copyrighted work also remains an open question. Numerous lawsuits have been brought against tech companies by writers, actors, artists, and publishers alleging that their copyrights have been violated in the AI-training process: As Breaking Bad’s creator, Vince Gilligan, wrote to the U.S. Copyright Office last year, generative AI amounts to “an extraordinarily complex and energy-intensive form of plagiarism.” Tech companies have argued that training AI systems on copyrighted work is “fair use,” but a court has yet to rule on this claim. In the language of copyright law, subtitles are likely considered derivative works, and a court would generally see them as protected by the same rules against copying and distribution as the movies they’re taken from. The OpenSubtitles data set has circulated among AI developers since 2020. It is part of the Pile, a collection of data sets for training generative AI. The Pile also includes text from books, patent applications, online discussions, philosophical papers, YouTube-video subtitles, and more. It’s an easy way for companies to start building AI systems without having to find and download the many gigabytes of high-quality text that LLMs require.

[Read: Generative AI is challenging a 234-year-old law]

OpenSubtitles can be downloaded by anyone who knows where to look, but as with most AI-training data sets, it’s not easy to understand what’s in it. It’s a 14-gigabyte text file with short lines of unattributed dialogue—meaning the speaker is not identified. There’s no way to tell where one movie ends and the next begins, let alone what the movies are. I downloaded a “raw” version of the data set, in which the movies and episodes were separated into 446,612 files and stored in folders whose names corresponded to the ID numbers of movies and episodes listed on IMDb.com. Most folders contained multiple subtitle versions of the same movie or TV show (different releases may be tweaked in various ways), but I was able to identify at least 139,000 unique movies and episodes. I downloaded metadata associated with each title from the OpenSubtitles.org website—allowing me to map actors and directors to each title, for instance—and used it to build the tool above.

The OpenSubtitles data set adds yet another wrinkle to a complex narrative around AI, in which consent from artists and even the basic premise of the technology are points of contention. Until very recently, no writer putting pen to paper on a script would have thought their creative work might be used to train programs that could replace them. And the subtitles themselves were not originally intended for this purpose, either. The multilingual OpenSubtitles data set contained subtitles in 62 different languages and 1,782 language-pair combinations: It is meant for training the models behind apps such as Google Translate and DeepL, which can be used to translate websites, street signs in a foreign country, or an entire novel. Jörg Tiedemann, one of the data set’s creators, wrote in an email that he was happy to see OpenSubtitles being used in LLM development, too, even though that was not his original intention.

He is, in any case, powerless to stop it. The subtitles are on the internet, and there’s no telling how many independent generative-AI programs they’ve been used for, or how much synthetic writing those programs have produced. But now, at least, we know a bit more about who is caught in the machinery. What will the world decide they are owed?