Itemoids

Wire

The Hollywood AI Database

The Atlantic

www.theatlantic.com › technology › archive › 2024 › 11 › opensubtitles-ai-data-set › 680650

Editor’s note: This analysis is part of The Atlantic’s investigation into the OpenSubtitles data set. You can access the search tool directly here. Find The Atlantic's search tool for books used to train AI here.

For as long as generative-AI chatbots have been on the internet, Hollywood writers have wondered if their work has been used to train them. The chatbots are remarkably fluent with movie references, and companies seem to be training them on all available sources. One screenwriter recently told me he’s seen generative AI reproduce close imitations of The Godfather and the 1980s TV show Alf, but he had no way to prove that a program had been trained on such material.

I can now say with absolute confidence that many AI systems have been trained on TV and film writers’ work. Not just on The Godfather and Alf, but on more than 53,000 other movies and 85,000 other TV episodes: Dialogue from all of it is included in an AI-training data set that has been used by Apple, Anthropic, Meta, Nvidia, Salesforce, Bloomberg, and other companies. I recently downloaded this data set, which I saw referenced in papers about the development of various large language models (or LLMs). It includes writing from every film nominated for Best Picture from 1950 to 2016, at least 616 episodes of The Simpsons, 170 episodes of Seinfeld, 45 episodes of Twin Peaks, and every episode of The Wire, The Sopranos, and Breaking Bad. It even includes prewritten “live” dialogue from Golden Globes and Academy Awards broadcasts. If a chatbot can mimic a crime-show mobster or a sitcom alien—or, more pressingly, if it can piece together whole shows that might otherwise require a room of writers—data like this are part of the reason why.

[Read: These 183,000 books are fueling the biggest fight in publishing and tech]

The files within this data set are not scripts, exactly. Rather, they are subtitles taken from a website called OpenSubtitles.org. Users of the site typically extract subtitles from DVDs, Blu-ray discs, and internet streams using optical-character-recognition (OCR) software. Then they upload the results to OpenSubtitles.org, which now hosts more than 9 million subtitle files in more than 100 languages and dialects. Though this may seem like a strange source for AI-training data, subtitles are valuable because they’re a raw form of written dialogue. They contain the rhythms and styles of spoken conversation and allow tech companies to expand generative AI’s repertoire beyond academic texts, journalism, and novels, all of which have also been used to train these programs. Well-written speech is a rare commodity in the world of AI-training data, and it may be especially valuable for training chatbots to “speak” naturally.

According to research papers, the subtitles have been used by Anthropic to train its ChatGPT competitor, Claude; by Meta to train a family of LLMs called Open Pre-trained Transformer (OPT); by Apple to train a family of LLMs that can run on iPhones; and by Nvidia to train a family of NeMo Megatron LLMs. It has also been used by Salesforce, Bloomberg, EleutherAI, Databricks, Cerebras, and various other AI developers to build at least 140 open-source models distributed on the AI-development hub Hugging Face. Many of these models could potentially be used to compete with human writers, and they’re built without permission from those writers.

When I reached out to Anthropic for this article, the company did not provide a comment on the record. When I’ve previously spoken with Anthropic about its use of this data set, a spokesperson told me the company had “trained our generative-AI assistant Claude on the public dataset The Pile,” of which OpenSubtitles is a part, and “which is commonly used in the industry.” A Salesforce spokesperson told me that although the company has used OpenSubtitles in generative-AI development, the data set “was never used to inform or enhance any of Salesforce’s product offerings.” Apple similarly told me that its small LLM was intended only for research. However, both Salesforce and Apple, like other AI developers, have made their models available for developers to use in any number of different contexts. All other companies mentioned in this article—Nvidia, Bloomberg, EleutherAI, Databricks, and Cerebras—either declined to comment or did not respond to requests for comment.

You may search through the data set using the tool below.

Two years after the release of ChatGPT, it may not be surprising that creative work is used without permission to power AI products. Yet the notion remains disturbing to many artists and professionals who feel that their craft and livelihoods are threatened by programs. Transparency is generally low: Tech companies tend not to advertise whose work they use to train their products. The legality of training on copyrighted work also remains an open question. Numerous lawsuits have been brought against tech companies by writers, actors, artists, and publishers alleging that their copyrights have been violated in the AI-training process: As Breaking Bad’s creator, Vince Gilligan, wrote to the U.S. Copyright Office last year, generative AI amounts to “an extraordinarily complex and energy-intensive form of plagiarism.” Tech companies have argued that training AI systems on copyrighted work is “fair use,” but a court has yet to rule on this claim. In the language of copyright law, subtitles are likely considered derivative works, and a court would generally see them as protected by the same rules against copying and distribution as the movies they’re taken from. The OpenSubtitles data set has circulated among AI developers since 2020. It is part of the Pile, a collection of data sets for training generative AI. The Pile also includes text from books, patent applications, online discussions, philosophical papers, YouTube-video subtitles, and more. It’s an easy way for companies to start building AI systems without having to find and download the many gigabytes of high-quality text that LLMs require.

[Read: Generative AI is challenging a 234-year-old law]

OpenSubtitles can be downloaded by anyone who knows where to look, but as with most AI-training data sets, it’s not easy to understand what’s in it. It’s a 14-gigabyte text file with short lines of unattributed dialogue—meaning the speaker is not identified. There’s no way to tell where one movie ends and the next begins, let alone what the movies are. I downloaded a “raw” version of the data set, in which the movies and episodes were separated into 446,612 files and stored in folders whose names corresponded to the ID numbers of movies and episodes listed on IMDb.com. Most folders contained multiple subtitle versions of the same movie or TV show (different releases may be tweaked in various ways), but I was able to identify at least 139,000 unique movies and episodes. I downloaded metadata associated with each title from the OpenSubtitles.org website—allowing me to map actors and directors to each title, for instance—and used it to build the tool above.

The OpenSubtitles data set adds yet another wrinkle to a complex narrative around AI, in which consent from artists and even the basic premise of the technology are points of contention. Until very recently, no writer putting pen to paper on a script would have thought their creative work might be used to train programs that could replace them. And the subtitles themselves were not originally intended for this purpose, either. The multilingual OpenSubtitles data set contained subtitles in 62 different languages and 1,782 language-pair combinations: It is meant for training the models behind apps such as Google Translate and DeepL, which can be used to translate websites, street signs in a foreign country, or an entire novel. Jörg Tiedemann, one of the data set’s creators, wrote in an email that he was happy to see OpenSubtitles being used in LLM development, too, even though that was not his original intention.

He is, in any case, powerless to stop it. The subtitles are on the internet, and there’s no telling how many independent generative-AI programs they’ve been used for, or how much synthetic writing those programs have produced. But now, at least, we know a bit more about who is caught in the machinery. What will the world decide they are owed?

Richard Price’s Radical, Retrograde Novel

The Atlantic

www.theatlantic.com › magazine › archive › 2024 › 12 › lazarus-man-richard-price-book-review › 680397

In his tenth novel, Lazarus Man, Richard Price is, to borrow one of his own lines, on a “hunt for moments”—snapshots in time, chance encounters, fleeting interactions that reveal someone or something in a startling new light. “I’ve got like X-ray eyes for the little gestures that go right by everybody,” he explained in a profile timed to the publication of his 1992 novel, Clockers. “I don’t go for the big picture so much as a lot of little big pictures.” Mary Roe, a detective and one of the characters in his new book, shares that instinct. At the scene of a “larger horror,” what hits her most forcefully is not the dead bodies but a crushed USPS mail cart, “an everyday object so violently deformed.” It beckons her toward “an unasked-for comprehension of the whole.”

The currency that Lazarus Man—a patchwork of scenes from urban working-class life, set in the spring of 2008—trades in is the micro-epiphany. Price’s four interlaced East Harlem protagonists are big-souled people navigating narrow, “negotiated life.” What they want for themselves—someone to lie beside, a little more money, work that doesn’t involve selling something—rarely outpaces what is possible. They do not ask for much more of the world, or New York City, than it is ready to give. Each one of them is decent.

Felix Pearl is a 20-something photographer with a gig taking video footage of a playground and basketball courts for the parks department (to monitor safety protocols) and a habit of getting bamboozled by a pretty junkie. Mary, a respected member of the police department, is also the daughter of a prizefighter tormented by a mistake he made as a young man. Anthony Carter, a middle-aged divorcé, is a former salesman, former teacher, and former cokehead hoping to stumble onto a metaphysical truth that will mend the broken parts of his life. Royal Davis, a failed NBA hopeful, runs a funeral home and wishes he didn’t. When a tenement building collapses in Harlem, their paths become entangled and they reexamine their lives.

This disaster, which leaves six dead, is ostensibly the big event that sets the novel in motion, but it also feels almost beside the point. No one, including Price, shows much curiosity about what caused the collapse. In fact, Lazarus Man seems deeply uninterested in the idea of cause at all. The characters we encounter live a challenging existence; they are not quite on the cliff’s edge, but they are close enough to peer into the canyon without craning their neck. The novel has all the trappings of fiction as gritty urban social portraiture—the kind of enterprise that Price is associated with as the author of the drug-trade-steeped novel Clockers and a writer for HBO’s The Wire. Yet it isn’t.

[From the September 2024 issue: A satire of America’s obsession with identity]

Nor has Price written a gentrification novel about a changing Harlem, even as its Harlem is changing. Or a novel of proletarian discontent, though it is about discontented proletarians. Lazarus Man isn’t about structural racism either, despite being populated with minorities down on their luck and harangued by the police. What Price has given us is a retrograde novel. It is animated by unreconstructed, unembarrassed humanism.

His pages offer no fictional repackaging of uplift or pessimism or low-wattage Marxist theory. They depict no working-class heroes or Dickensian scoundrels. The characters are not pawns in some philosophical or political or cultural proxy war for which the novel is simply a vehicle. You would be forgiven for overlooking that the story is set amid the heat of Barack Obama’s historic presidential campaign, because this never comes up. Price’s characters are strapped but not completely stuck, battered by social structures but not paralyzed by them.

In his 1945 lecture “Existentialism Is a Humanism,” Jean-Paul Sartre observed that European existentialism had developed an undeserved reputation for being “gloomy,” denigrated as a philosophical movement obsessed with death, absurdity, anxiety, and the like. Sartre rejected this appraisal: Existentialism turned on the conviction that people can—in the face of history’s sweep, dehumanizing societal institutions, and unrestrained economic and technological development—choose how to live. Speaking before a sizable crowd at Paris’s Club Maintenant, Sartre addressed his critics. “Their excessive protests make me suspect that what is annoying them is not so much our pessimism,” the philosopher wryly observed, “but, much more likely, our optimism. For at bottom, what is alarming in the doctrine that I am about to try to explain to you is—is it not?—that it confronts man with a possibility of choice.”

Lazarus Man’s protagonists, confronted with exceptional circumstances they had no hand in generating, must nonetheless contend with the discomfiting reality of their own agency. This leaves Price walking a tightrope. His novel at once invites and undercuts the polarized attitudes toward social crisis that have recently become familiar—either fatalistic acceptance or righteous denunciation. Lazarus Man is about a traumatic event that defies a reflexive victim-culture response, as well as the lazy buck-up bromides favored by that culture’s critics.

Put a different way, it is a trauma novel without a trauma plot—pushing back against the formulaic storyline, so thoroughly skewered by The New Yorker’s Parul Sehgal, that reduces characters to predictable symptoms after some fateful event. The book’s author, too, isn’t readily fazed: Price, a white novelist writing yet again about Black urban life, betrays no signs of racial anxiety. “Northern white writers sometimes see black people as another species,” he noted in 2006. “I think the white writer sometimes says, ‘No, no, that’s a hornet’s nest.’ ” He’s still poking it.

The possibility that Price might have adopted the identitarian conventions of the previous decade or so—the last novel he wrote under his own name was Lush Life (2008), which unfolds on New York City’s Lower East Side—is swiftly ruled out by Anthony, the novel’s anchoring character. A half-Black, half–Italian Irish Ivy League screwup—years ago, he lost a full ride to Columbia for dealing drugs—he has been on a downward trajectory ever since. In his sober and unemployed middle age, he has been living in his dead parents’ tenement apartment and resists any attempt to frame himself as a victim. “A therapist suggested that as a Black student he might have subconsciously felt pressure to act out the role expected of him by the white students,” he reflects. Then he adds, “But that was bullshit.”

When Anthony is pulled out of the soot-gray rubble a third of the way through the novel—the reborn Lazarus of the book’s title—he is a changed man. Or, more accurately, he is a man desperately trying to play the part of a changed man. It is never quite clear, even as Lazarus Man rushes toward its devastating denouement, whom exactly Anthony is trying to convince of his redemption: the audiences who eventually come to hear him speak at community events, enchanted by the wisdom he has wrung from brute survival, or the man he sees in the mirror. To the extent that Price’s novel has a message, it is that epiphanies are a kind of theater we perform for ourselves. Faced with disaster or a momentous encounter, we are not gripped by revelation or metamorphosed in the fire of circumstance. Events do not transform us against our will. We decide, always retroactively, that some unexpected joy or undeserved blow is the stuff out of which a new life is made.

This idea, that we choose our own epiphanies, appears again and again. Mary, the worn-down detective, is especially epiphany-haunted, surrounded by people who have undergone sudden shifts of self. Her father, disturbed by his capacity for cruelty after a boxing opponent ended up permanently disabled, abruptly gave up the sweet science. Her husband is a reformed violent drunk whom Mary finds boring in his new meditative sobriety. And Mary herself lives in the long shadow of a halfway epiphany, restless in marriage and motherhood after a freak elevator accident two years earlier nearly killed her, leaving her searching for—and failing to find—new moorings. Mary spends much of the novel playing the role of dutiful detective, looking for a resident of the imploded building who hasn’t been seen since the day it crumbled. She tenuously connects the search to absolution for herself—guilt-ridden about being a distant mother—and for her father, convinced that discovering the missing man, dead or alive, will somehow land her on terra firma.

Lazarus Man possesses the same kind of telegenic quality that made Clockers an inspiration for The Wire. Some vignettes read like hilarious set pieces. When the tenement dissolves into a haze of white smog and rubble, Royal is dozing in one of his unsold coffins. Awoken by the noise, he pushes open the lid of his pine box and sits bolt upright, scaring witless the group of film-school students to whom he’s rented out his struggling funeral home so they can shoot a bad zombie movie. This slapstick gives way to something darker as Royal, knowing that the rumbling boom means bodies—and thus business—instructs his son to put on his best black suit and go hawk their services. Other moments give way to a gentle melancholy.

And as Anthony is slowly transformed into a minor New York celebrity—first thanks to a local-news appearance, and then through a series of speeches he is coaxed into giving—his ordeal gels into an earnest if squishy doctrine, one part self-help and one part call to duty. He proclaims again and again that his only goal is to “be of service.” His lectures are full of clichés and pseudo-profundities—“The street can be a brutal sculptor”—but his overwrought aphorisms also land, the kinds of phrases that audience members scribble down and later recite around the dinner table. Anthony’s underlying theme is always that change is possible, that the worst that comes to pass will end up being “the best thing that could possibly happen to you.” Personal catastrophe, Anthony preaches, is a gift. A sheep in wolf’s clothing.

[From the May 1976 issue: A review of Richard Price’s second novel, Bloodbrothers]

But in the end, Lazarus Man rejects its own Lazarus. Or at least Price subverts his post-traumatic gospel. When a woman approaches Anthony after one of his appearances, interrogating him about his mantra—“Whatever befalls you no matter how heartbreaking or onerous will turn out to be the best thing”—he finds himself, for the first time, at a loss for words. She tells him about a husband of two decades, newly dead. Three young kids at home and an ailing mother. An apartment slipping through her grasp. Plainly, no alchemy is forthcoming: The fragments of her life will not turn to gold if she just hopes hard enough. After Anthony mumbles something about God, she lets him have it.

When the novel at last gives up its final secret—who our Lazarus Man is, really—the big reveal does not hand over any certainty as to what lies in Anthony’s heart. The question that haunts the second half of the book is whether he is a con artist or a genuine street prophet. The answer ends up being neither. Or both. The simple truth is that one bad decision led to worse decisions, then to better ones. The same could be said of each character. As to the question of whether that building collapse truly made a new man, no one (including Anthony) is sure.

The genius of Price’s novel is that it rejects all mechanistic accounts of human existence—tragic or utopian, religious or otherwise—without downplaying the social forces that shape lives of labor. Price isn’t peddling a bootstraps humanism. Anthony, Felix, Royal, and Mary cannot pull themselves up into a more comfortable middle-class existence through sheer will, or by the thaumaturgic power of some hoped-for epiphany. They cannot be exactly who they want to be. But Price holds them accountable for who they are, and the choices they make within the world as it is given to them. Lazarus Man leaves us with four people still lurching toward an uncertain transformation. “I’m thinking a few things,” Royal muses. “All I know for sure is that I have to make a life that I can live with.”

This article appears in the December 2024 print edition with the headline “Richard Price’s Radical, Retrograde Novel.”