Google isn’t ready to turn search into a conversation

The future of search is a conversation — at least, according to Google.

It’s a pitch the company has been making for years, and it was the centerpiece of last week’s I/O developer conference. There, the company demoed two “groundbreaking” AI systems — LaMDA and MUM — that it hopes, one day, to integrate into all its products. To show off its potential, Google had LaMDA speak as the dwarf planet Pluto, answering questions about the celestial body’s environment and its flyby from the New Horizons probe.

As this tech is adopted, users will be able to “talk to Google”: using natural language to retrieve information from the web or their personal archives of messages, calendar appointments, photos, and more.

This is more than just marketing for Google. The company has evidently been contemplating what would be a major shift to its core product for years. A recent research paper from a quartet of Google engineers titled “Rethinking Search” asks exactly this: is it time to replace “classical” search engines, which provide information by ranking webpages, with AI language models that deliver these answers directly instead?

There are two questions to ask here. First is can it be done? After years of slow but definite progress, are computers really ready to understand all the nuances of human speech? And secondly, should it be done? What happens to Google if the company leaves classical search behind? Appropriately enough, neither question has a simple answer.

Can it be done?
There’s no doubt that Google has been pushing a vision of speech-driven search for a long time now. It debuted Google Voice Search in 2011, then upgraded it to Google Now in 2012; launched Assistant in 2016; and in numerous I/Os since, has foregrounded speech-driven, ambient computing, often with demos of seamless home life orchestrated by Google.

Despite clear advances, I’d argue that actual utility of this technology falls far short of the demos. Check out the introduction below of Google Home in 2016, for example, where Google promises that the device will soon let users “control things beyond the home, like booking a car, ordering dinner, or sending flowers to mom, and much, much more.” Some of these things are now technically feasible, but I don’t think they’re common: speech has not proven to be the flexible and faultless interface of our dreams.

Everyone will have different experiences, of course, but I find that I only use my voice for very limited tasks. I dictate emails on my computer, set timers on my phone, and play music on my smart speaker. None of these constitute a conversation. They are simple commands, and experience has taught me that if I try anything more complicated, words will fail. Sometimes this is due to not being heard correctly (Siri is atrocious on that score), but often it just makes more sense to tap or type my query into a screen.

Watching this year’s I/O demos I was reminded of the hype surrounding self-driving cars, a technology that has so far failed to deliver on its biggest claims (remember Elon Musk promising that a self-driving car would take a cross country trip in 2018? It hasn’t happened yet). There are striking parallels between the fields of autonomous driving and speech tech. Both have seen major improvements in recent years thanks to the arrival of new machine learning techniques coupled with abundant data and cheap computation. But both also struggle with the complexity of the real world.

In the case of self-driving cars, we’ve created vehicles that don’t perform reliably outside of controlled settings. In good weather, with clear road markings, and on wide streets, self-driving cars work well. But steer them into the real world, with its missing signs, sleet and snow, unpredictable drivers, and they are clearly far from fully autonomous.

It’s not hard to see the similarity with speech. The technology can handle simple, direct commands that require the recognition of only a small number of verbs and nouns (think “play music,” “check the weather” and so on) as well as a few basic follow-ups, but throw these systems into the deep waters of conversation and they flounder. As Google’s CEO Sundar Pichai commented at I/O last week: “Language is endlessly complex. We use it to tell stories, crack jokes, and share ideas. […] The richness and flexibility of language make it one of humanity’s greatest tools and one of computer sciences’ greatest challenges.”

However, there are reasons to think things are different now (for speech anyway). As Google noted at I/O, it’s had tremendous success with a new machine learning architecture known as Transformers, a model that now underpins the world’s most powerful natural language processing (NLP) systems, including OpenAI’s GPT-3 and Google’s BERT. (If you’re looking for an accessible explanation of the underlying tech and why it’s so good at parsing language, I highly recommend this blog post from Google engineer Dale Markowitz.)

The arrival of Transformers has created a truly incredible, genuinely awe-inspiring flowering of AI language capabilities. As has been demonstrated with GPT-3, AI can now generate a seemingly endless variety of text, from poetry to plays, creative fiction to code, and much more, always with surprising ingenuity and verve. They also deliver state-of-the-art results in various speech and linguistic tests and, what’s better, systems scale incredibly well. That means if you pump in more computational power, you get reliable improvements. The supremacy of this paradigm is sometimes known in AI as the “bitter lesson” and is very good news for companies like Google. After all, they’ve got plenty of compute, and that means there’s lots of road ahead to improve these systems.

Google channeled this excitement at I/O. During a demo of LaMDA, which has been trained specifically on conversational dialogue, the AI model pretended first to be Pluto, then a paper airplane, answering questions with imagination, fluency, and (mostly) factual accuracy. “Have you ever had any visitors?” a user asked LaMDA-as-Pluto. The AI responded: “Yes I have had some. The most notable was New Horizons, the spacecraft that visited me.”

A demo of MUM, a multi-modal model that understands not only text but also image and video, had a similar focus on conversation. When the model was asked: “I’ve hiked Mt. Adams and now want to hike Mt. Fuji next fall, what should I do differently to prepare?” it was smart enough to know that the questioner is not only looking to compare mountains, but that “preparation” means finding weather-appropriate gear and relevant terrain training. If this sort of subtlety can transfer into a commercial product — and that’s obviously a huge, skyscraper-sized if — then it would be a genuine step forward for speech computing.

Should it be done?
That, though, brings us to the next big question: even if Google can turn speech into a conversation, should it? I won’t pretend to have a definitive answer to this, but it’s not hard to see big problems ahead if Google goes down this route.

First are the technical problems. The biggest is that it’s impossible for Google (or any company) to reliably validate the answers produced by the sort of language AI the company is currently demoing. There’s no way of knowing exactly what these sorts of models have learned or what the source is for any answer they provide. Their training data usually consists of sizable chunks of the internet and, as you’d expect, this includes both reliable data and garbage misinformation. Any response they give could be pulled from anywhere online. This can also lead them to producing output that reflects the sexist, racist, and biased notions embedded in parts of their training data. And these are criticisms that Google itself has seemingly been unwilling to reckon with.

Similarly, although these systems have broad capabilities, and are able to speak on a wide array of topics, their knowledge is ultimately shallow. As Google’s researchers put it in their paper “Rethinking Search,” these systems learn assertions like “the sky is blue,” but not associations or causal relationships. That means that they can easily produce bad information based on their own misunderstanding of how the world works.

Kevin Lacker, a programmer and former Google search quality engineer, illustrated these sorts of errors in GPT-3 in this informative blog post, noting how you can stump the program with common sense questions like “Which is heavier, a toaster or a pencil?” (GPT-3 says: “A pencil”) and “How many eyes does my foot have?” (A: “Your foot has two eyes”).

To quote Google’s engineers again from “Rethinking Search”: these systems “do not have a true understanding of the world, they are prone to hallucinating, and crucially they are incapable of justifying their utterances by referring to supporting documents in the corpus they were trained over.”

These issues are amplified by the sort of interface Google is envisioning. Although it’s possible to overcome difficulties with things like sourcing (you can train a model to provide citations, for example, noting the source of each fact it gives), Google imagines every answer being delivered ex cathedra, as if spoken by Google itself. This potentially creates a burden of trust that doesn’t exist with current search engines, where it’s up to the user to assess the credibility of each source and the context of the information they’re shown.

The pitfalls of removing this context is obvious when we look at Google’s “featured snippets” and “knowledge panels” — cards that Google shows at the top of the Google.com search results page in response to specific queries. These panels highlight answers as if they’re authoritative but the problem is they’re often not, an issue that former search engine blogger (and now Google employee) Danny Sullivan dubbed the “one true answer” problem.

These snippets have made headlines when users discover particularly egregious errors. One example from 2017 involved asking Google “Is Obama planning martial law?” and receiving the answer (cited from a conspiracy news site) that, yes, of course he is (if he was, it didn’t happen).

In the demos Google showed at I/O this year of LaMDA and MUM, it seems the company is still leaning toward this “one true answer” format. You ask and the machine answers. In the MUM demo, Google noted that users will also be “given pointers to go deeper on topics,” but it’s clear that the interface the company dreams of is a direct back and forth with Google itself.

This will work for some queries, certainly; for simple demands that are the search equivalent of asking Siri to set a timer on my phone (e.g. asking when was Madonna born, who sang “Lucky Star,” and so on). But for complex problems, like those Google demoed at I/O with MUM, I think they’ll fall short. Tasks like planning holidays, researching medical problems, shopping for big-ticket items, looking for DIY advice, or digging into a favorite hobby, all require personal judgement, rather than computer summary.

The question, then, is will Google be able to resist the lure of offering one true answer? Tech watchers have noted for a while that the company’s search products have become more Google-centric over time. The company increasingly buries results under ads that are both external (pointing to third-party companies) and internal (directing users to Google services). I think the “talk to Google” paradigm fits this trend. The underlying motivation is the same: it’s about removing intermediaries and serving users directly, presumably because Google believes it’s best positioned to do so.

In a way, this is the fulfillment of Google’s corporate mission “to organise the world’s information and make it universally accessible and useful.” But this approach could also undermine what makes the company’s product such a success in the first place. Google isn’t useful because it tells you what you need to know, it’s useful because it helps you find this information for yourself. Google is the index, not the encyclopedia and it shouldn’t sacrifice search for results.