
AI: The OpenAI vs Google AI 'Data Dance'. RTZ #825
As LLM AI companies continue to race ahead outdoing each other almost weekly in the capabilities of its Generative AI models, Data to both train and run the inference queries of hundreds of millions of users remain an essential never-ending fuel.
Data, both human generated and synthetic, are where the big and small Ai companies are turning to. It’s the key unique Box no. 4 in the AI Tech Stack below, something not seen in previous tech stacks for the PC, Internet, Mobile and many others.
Especially in the arena of ‘Physical AI’, where AI driven robots and cars in particular need extraordinary and accelerating amounts of synthetic data to train the robots of the future. Especially wht humanoid ones.
But coming back to LLM AIs, one of the primary sources of raw data continues to be humans searching for things on the internet. Especially via the most ubiquitous desktop and smartphone application of them all, web browsers. Thus the focus shifts to Google and Apple, the two companies with the most popular web browsers on the planet, Chrome and Safari. And soon OpenAI’s expected ‘AI Browser’.
Thus the new arena of AI Browsers, a topic I’ve written and discussed intensively, is another key battleground for new forms of data to feed the ever growing need of large and small language models going forward.
And to Google alone, in the arena of web searches, since Google of course dominates the global market of internet search. Except in China of course, where Baidu and others rule the roost.
Thus it’s no surprise to see the Information highlight the rising competition between OpenAI and Google in “OpenAI Is Challenging Google—While Using Its Search Data”:
“As it tries to unseat Google, OpenAI is relying on search data from an unlikely source: Google.”
This despite the increasing reliance on Google by OpenAI, for the underlying AI data center ‘Compute’ on Google’s TPU chip architecture.
“OpenAI has been using Google search results scraped from the web to help power ChatGPT responses, according to two people with knowledge of it.”
“The Google search data helps answer ChatGPT queries on current events, such as news, sports and equity markets, one of the people said.”
The source of course are third-party industry data providers:
“OpenAI is getting the data from SerpApi, an eight-year-old web-scraping firm, which listed OpenAI as a customer on its website as recently as May last year. It removed the reference for reasons that couldn’t be learned.”
Despite of course the aforementioned ‘Frenemies’ situation described above:
“At the same time, OpenAI has begun to rent cloud servers from Google Cloud to power ChatGPT, suggesting that Google believes it can still benefit from the rise of OpenAI, in a way similar to how it forged deep business ties with other longtime rivals such as Apple and Meta Platforms.”
Google is not taking this OpenAI activity lying down:
“Even so, Google has shown a sensitivity about allowing OpenAI to access its search data directly. A year ago, Google rejected OpenAI’s request to do so to develop search for ChatGPT, according to testimony and emails in Google’s ongoing antitrust case.”
“Google executives have privately derided SerpApi, which is based in Austin, Texas, and tried various techniques to make it harder for the firm to scrape high-quality information through its web crawler, said a person with knowledge of the efforts. It isn’t clear how successful those efforts have been.”
“Google doesn’t appear to have taken legal steps to try to shut down SerpApi, whose actions might run afoul of Google’s terms of service. Due to regulatory scrutiny, Google may be wary of going after competitors who use its search results. In its ongoing antitrust court battle with the Department of Justice, the judge overseeing the case has signaled support for forcing Google to share its search results data with rivals.”
Both sides try and keep things as civil as possible:
“We retrieve accurate, contextually relevant information from web pages and a variety of providers. This allows us to surface and synthesize information from multiple sources,” an OpenAI spokesperson said in a statement. A Google spokesperson and SerpApi CEO Julien Khaleghy declined to comment.”
“This isn’t the first time OpenAI has used Google data to boost its artificial intelligence products. The ChatGPT maker previously illicitly used data from YouTube videos to train some of its AI models, The Information reported.”
Of course OpenAI is not the only one taking this route:
“OpenAI also isn’t the only Google rival to use SerpApi data. SerpApi’s website previously listed Apple as a customer. In addition to partnering with Google on search, the iPhone maker develops technology to power searches in Safari—a lucrative deal that the judge overseeing the DOJ case could also nix.”
“SerpApi also lists Perplexity, which runs an AI search engine, as a customer. OpenAI in January estimated it handled at least 25 times more web searches per day than Perplexity, according to a government filing.”
And OpenAI is not only relying on Google as a raw data souce:
“OpenAI doesn’t rely entirely on Google search results for its search responses. It uses its own web crawler for obtaining and indexing web data, and it has also gotten data from Microsoft’s Bing via an application programming interface, which allows developers to access and use its search results. Other companies including Brave and Exa offer similar search APIs, but Google does not because it considers search data one of its crown jewels.”
But the Google data fingerprints are easily found:
“OpenAI executives themselves have admitted that it would be extremely difficult for them to replicate Google’s level of accuracy on their own when it comes to uncommon search queries.”
“Outside developers have begun to notice Google search results popping up in ChatGPT.”
“Our goal—which was a lofty goal, and we’re nowhere near close—is to serve about 80% of our traffic from our own first-party index,” said Nick Turley, head of product for ChatGPT, during a hearing in the Google case in April. “We think 100% [this] is long-term attainable but so far away and so uncertain that it’s not an operationalizable goal, even for a set of smart people who are ambitious and think they can do the impossible.”
Google also works with other AI ‘frenemies’ like Meta on this front:
“Still, Google has shown it is willing to provide search information to rivals such as Meta Platforms, which uses Google to help answer some users’ questions in the Meta AI chatbot. SerpApi’s website also lists Meta as a customer.”
The relative numbers are also notable:
“Google said in March that it processes more than 5 trillion searches annually. That means it handles dozens of times more web searches per day than ChatGPT, which says 700 million people use the chatbot per week.”
“The competition doesn’t appear to have dampened Google’s search ad revenue—which grew 11.7% in the June quarter—though executives and shareholders have expressed concern that the rise of ChatGPT will eventually crimp Google’s growth.”
“Less than three years old, ChatGPT is already on pace to generate more than $10 billion in revenue annually from subscriptions alone. OpenAI has discussed making money from free users by selling ads or getting commissions from retailers and other businesses that generate sales from people who find their products and services through ChatGPT.”s
And the AI data quest spills over to online ecommerce:
“Google also has information about millions of products available through online shopping sites or retailers, which it displays in shopping-related search results. It isn’t clear if Google would be willing to license that type of information to other firms, such as OpenAI, which wants ChatGPT to be a shopping search destination.”
All this is to highlight a key area to watch as LLM AI companies continue to scale their AI efforts at these early stages of the AI Tech Wave.
And how commercial and regulatory realities also shape the interactions for AI Data, amongst the industry participants, both large and small. Stay tuned.
(NOTE: The discussions here are for information purposes only, and not meant as investment advice at any time. Thanks for joining us here)