Itemoids

iPhones

The AI War Was Never Just About AI

The Atlantic

www.theatlantic.com › technology › archive › 2024 › 11 › google-antitrust-generative-ai › 680803

For almost two years now, the world’s biggest tech companies have been at war over generative AI. Meta may be known for social media, Google for search, and Amazon for online shopping, but since the release of ChatGPT, each has made tremendous investments in an attempt to dominate in this new era. Along with start-ups such as OpenAI, Anthropic, and Perplexity, their spending on data centers and chatbots is on track to eclipse the costs of sending the first astronauts to the moon.

To be successful, these companies will have to do more than build the most “intelligent” software: They will need people to use, and return to, their products. Everyone wants to be Facebook, and nobody wants to be Friendster. To that end, the best strategy in tech hasn’t changed: build an ecosystem that users can’t help but live in. Billions of people use Google Search every day, so Google built a generative-AI product known as “AI Overviews” right into the results page, granting it an immediate advantage over competitors.

This is why a recent proposal from the Department of Justice is so significant. The government wants to break up Google’s monopoly over the search market, but its proposed remedies may in fact do more to shape the future of AI. Google owns 15 products that serve at least half a billion people and businesses each—a sprawling ecosystem of gadgets, search and advertising, personal applications, and enterprise software. An AI assistant that shows up in (or works well with) those products will be the one that those people are most likely to use. And Google has already woven its flagship Gemini AI models into Search, Gmail, Maps, Android, Chrome, the Play Store, and YouTube, all of which have at least 2 billion users each. AI doesn’t have to be life-changing to be successful; it just has to be frictionless. The DOJ now has an opportunity to add some resistance. (In a statement last week, Kent Walker, Google’s chief legal officer, called the Department of Justice’s proposed remedy part of an “interventionist agenda that would harm Americans and America’s global technology leadership,” including the company’s “leading role” in AI.)

[Read: The horseshoe theory of Google Search]

Google is not the only competitor with an ecosystem advantage. Apple is integrating its Apple Intelligence suite across eligible iPhones, iPads, and Macs. Meta, with more than 3 billion users across its platforms, including Facebook, Instagram, and WhatsApp, enjoys similar benefits. Amazon’s AI shopping assistant, Rufus, has garnered little major attention but nonetheless became available to the website’s U.S. shoppers this fall. However much of the DOJ’s request the court ultimately grants, these giants will still lead the AI race—but Google had the clearest advantage among them.

Just how good any of these companies’ AI products are has limited relevance to their adoption. Google’s AI tools have repeatedly shown major flaws, such as confidently recommending eating rocks for good health, but the features continue to be used by more and more people simply because they’re there. Similarly, Apple’s AI models are less powerful than Gemini or ChatGPT, but they will have a huge user base simply because of how popular the iPhone is. Meta’s AI models may not be state-of-the-art, but that doesn’t matter to billions of Facebook, Instagram, and WhatsApp users who just want to ask a chatbot a silly question or generate a random illustration. Tech companies without such an ecosystem are well aware of their disadvantage: OpenAI, for instance, is reportedly considering developing its own web browser, and it has partnered with Apple to integrate ChatGPT across the company’s phones, tablets, and computers.

[Read: AI search is turning into the problem everyone worried about]

This is why it’s relevant that the DOJ’s proposed antitrust remedy takes aim at Google’s broader ecosystem. Federal and state attorneys asked the court to force Google to sell off its Chrome browser; cease preferencing its search products in the Android mobile operating system; prevent it from paying other companies, including Apple and Samsung, to make Google the default search engine; and allow rivals to syndicate Google’s search results and use its search index to build their own products. All of these and the DOJ’s other requests, under the auspices of search, are really shots at Google’s expansive empire.

As my colleague Ian Bogost has argued, selling Chrome might not affect Google’s search dominance: “People returned to Google because they wanted to, not just because the company had strong-armed them,” he wrote last week. But selling Chrome and potentially Android, as well as preventing Google from making its search engine the default option for various other companies’ products, would make it harder for Google to funnel billions of people to the rest of its software, including AI. Meanwhile, access to Google’s search index could provide a huge boost to OpenAI, Perplexity, Microsoft, and other AI search competitors: Perhaps the hardest part of building a searchbot is trawling the web for reliable links, and rivals would gain access to the most coveted way of doing so.

[Read: Google already won]

The Justice Department seems to recognize that the AI war implicates and goes beyond search. Without intervention, Google’s search monopoly could give it an unfair advantage over AI as well—and an AI monopoly could further entrench the company’s control over search. The court, attorneys wrote, must prevent Google from “manipulating the development and deployment of new technologies,” most notably AI, to further throttle competition.

And so the order also takes explicit aim at AI. The DOJ wants to bar Google from self-preferencing AI products, in addition to Search, in Chrome, Android, and all of its other products. It wants to stop Google from buying exclusive rights to sources of AI-training data and disallow Google from investing in AI start-ups and competitors that are in or might enter the search market. (Two days after the DOJ released its proposal, Amazon invested another $4 billion into Anthropic, the start-up and OpenAI rival that Google has also heavily backed to this point, suggesting that the e-commerce giant might be trying to lock in an advantage over Google.) The DOJ also requested that Google provide a simple way for publishers to opt out of their content being used to train Google’s AI models or be cited in AI-enhanced search products. All of that will make it harder for Google to train and market future AI models, and easier for its rivals to do the same.

When the DOJ first sued Google, in 2020, it was concerned with the internet of old: a web that appeared intractably stuck, long ago calcified in the image of the company that controls how billions of people access and navigate it. Four years and a historic victory later, its proposed remedy enters an internet undergoing an upheaval that few could have foreseen—but that the DOJ’s lawsuit seems to have nonetheless anticipated. A frequently cited problem with antitrust litigation in tech is anachronism, that by the time a social-media, or personal-computing, or e-commerce monopoly is apparent, it is already too late to disrupt. With generative AI, the government may finally have the head start it needs.

The Hollywood AI Database

The Atlantic

www.theatlantic.com › technology › archive › 2024 › 11 › opensubtitles-ai-data-set › 680650

Editor’s note: This analysis is part of The Atlantic’s investigation into the OpenSubtitles data set. You can access the search tool directly here. Find The Atlantic's search tool for books used to train AI here.

For as long as generative-AI chatbots have been on the internet, Hollywood writers have wondered if their work has been used to train them. The chatbots are remarkably fluent with movie references, and companies seem to be training them on all available sources. One screenwriter recently told me he’s seen generative AI reproduce close imitations of The Godfather and the 1980s TV show Alf, but he had no way to prove that a program had been trained on such material.

I can now say with absolute confidence that many AI systems have been trained on TV and film writers’ work. Not just on The Godfather and Alf, but on more than 53,000 other movies and 85,000 other TV episodes: Dialogue from all of it is included in an AI-training data set that has been used by Apple, Anthropic, Meta, Nvidia, Salesforce, Bloomberg, and other companies. I recently downloaded this data set, which I saw referenced in papers about the development of various large language models (or LLMs). It includes writing from every film nominated for Best Picture from 1950 to 2016, at least 616 episodes of The Simpsons, 170 episodes of Seinfeld, 45 episodes of Twin Peaks, and every episode of The Wire, The Sopranos, and Breaking Bad. It even includes prewritten “live” dialogue from Golden Globes and Academy Awards broadcasts. If a chatbot can mimic a crime-show mobster or a sitcom alien—or, more pressingly, if it can piece together whole shows that might otherwise require a room of writers—data like this are part of the reason why.

[Read: These 183,000 books are fueling the biggest fight in publishing and tech]

The files within this data set are not scripts, exactly. Rather, they are subtitles taken from a website called OpenSubtitles.org. Users of the site typically extract subtitles from DVDs, Blu-ray discs, and internet streams using optical-character-recognition (OCR) software. Then they upload the results to OpenSubtitles.org, which now hosts more than 9 million subtitle files in more than 100 languages and dialects. Though this may seem like a strange source for AI-training data, subtitles are valuable because they’re a raw form of written dialogue. They contain the rhythms and styles of spoken conversation and allow tech companies to expand generative AI’s repertoire beyond academic texts, journalism, and novels, all of which have also been used to train these programs. Well-written speech is a rare commodity in the world of AI-training data, and it may be especially valuable for training chatbots to “speak” naturally.

According to research papers, the subtitles have been used by Anthropic to train its ChatGPT competitor, Claude; by Meta to train a family of LLMs called Open Pre-trained Transformer (OPT); by Apple to train a family of LLMs that can run on iPhones; and by Nvidia to train a family of NeMo Megatron LLMs. It has also been used by Salesforce, Bloomberg, EleutherAI, Databricks, Cerebras, and various other AI developers to build at least 140 open-source models distributed on the AI-development hub Hugging Face. Many of these models could potentially be used to compete with human writers, and they’re built without permission from those writers.

When I reached out to Anthropic for this article, the company did not provide a comment on the record. When I’ve previously spoken with Anthropic about its use of this data set, a spokesperson told me the company had “trained our generative-AI assistant Claude on the public dataset The Pile,” of which OpenSubtitles is a part, and “which is commonly used in the industry.” A Salesforce spokesperson told me that although the company has used OpenSubtitles in generative-AI development, the data set “was never used to inform or enhance any of Salesforce’s product offerings.” Apple similarly told me that its small LLM was intended only for research. However, both Salesforce and Apple, like other AI developers, have made their models available for developers to use in any number of different contexts. All other companies mentioned in this article—Nvidia, Bloomberg, EleutherAI, Databricks, and Cerebras—either declined to comment or did not respond to requests for comment.

You may search through the data set using the tool below.

Two years after the release of ChatGPT, it may not be surprising that creative work is used without permission to power AI products. Yet the notion remains disturbing to many artists and professionals who feel that their craft and livelihoods are threatened by programs. Transparency is generally low: Tech companies tend not to advertise whose work they use to train their products. The legality of training on copyrighted work also remains an open question. Numerous lawsuits have been brought against tech companies by writers, actors, artists, and publishers alleging that their copyrights have been violated in the AI-training process: As Breaking Bad’s creator, Vince Gilligan, wrote to the U.S. Copyright Office last year, generative AI amounts to “an extraordinarily complex and energy-intensive form of plagiarism.” Tech companies have argued that training AI systems on copyrighted work is “fair use,” but a court has yet to rule on this claim. In the language of copyright law, subtitles are likely considered derivative works, and a court would generally see them as protected by the same rules against copying and distribution as the movies they’re taken from. The OpenSubtitles data set has circulated among AI developers since 2020. It is part of the Pile, a collection of data sets for training generative AI. The Pile also includes text from books, patent applications, online discussions, philosophical papers, YouTube-video subtitles, and more. It’s an easy way for companies to start building AI systems without having to find and download the many gigabytes of high-quality text that LLMs require.

[Read: Generative AI is challenging a 234-year-old law]

OpenSubtitles can be downloaded by anyone who knows where to look, but as with most AI-training data sets, it’s not easy to understand what’s in it. It’s a 14-gigabyte text file with short lines of unattributed dialogue—meaning the speaker is not identified. There’s no way to tell where one movie ends and the next begins, let alone what the movies are. I downloaded a “raw” version of the data set, in which the movies and episodes were separated into 446,612 files and stored in folders whose names corresponded to the ID numbers of movies and episodes listed on IMDb.com. Most folders contained multiple subtitle versions of the same movie or TV show (different releases may be tweaked in various ways), but I was able to identify at least 139,000 unique movies and episodes. I downloaded metadata associated with each title from the OpenSubtitles.org website—allowing me to map actors and directors to each title, for instance—and used it to build the tool above.

The OpenSubtitles data set adds yet another wrinkle to a complex narrative around AI, in which consent from artists and even the basic premise of the technology are points of contention. Until very recently, no writer putting pen to paper on a script would have thought their creative work might be used to train programs that could replace them. And the subtitles themselves were not originally intended for this purpose, either. The multilingual OpenSubtitles data set contained subtitles in 62 different languages and 1,782 language-pair combinations: It is meant for training the models behind apps such as Google Translate and DeepL, which can be used to translate websites, street signs in a foreign country, or an entire novel. Jörg Tiedemann, one of the data set’s creators, wrote in an email that he was happy to see OpenSubtitles being used in LLM development, too, even though that was not his original intention.

He is, in any case, powerless to stop it. The subtitles are on the internet, and there’s no telling how many independent generative-AI programs they’ve been used for, or how much synthetic writing those programs have produced. But now, at least, we know a bit more about who is caught in the machinery. What will the world decide they are owed?

Warren Buffett thinks pizza is more valuable than iPhones right now

Quartz

qz.com › berkshire-hathaway-buffett-apple-dominos-pizza-stock-1851700073

Warren Buffett’s Berkshire Hathaway (BRK.A) acquired a stake in Domino’s Pizza in the third quarter, even after dumping millions of dollars worth of Apple (AAPL) and Bank of America (BAC) to grow its cash pile.

Read more...