Search Beyond The Keyword

Teaching machines to see – the future of visual search
by James Murray
, August 10, 2016

Search Beyond The Keyword

As part of IAB search week, James Murray, UK Search Advertising Lead at Microsoft gives us an insight into the future of visual search.

Search beyond the keyword

For the last twenty years, search has been almost exclusively a keyword-driven business. We’ve understood what people want by the words that they use.

In so doing, search has catalogued and indexed the world’s textual information. But clearly there is so much more to experience than text. More recently, with the adoption of voice search, we’ve started to move beyond keywords.

Consider the search term: ‘Do I need an umbrella tomorrow?’ If we were to process that question on a purely keyword basis, then the result would drive consumers to a retail product page to buy a new umbrella.

But of course, the query has nothing to do with wet-weather apparel. Only recently have we had the capability to be able to discern that “Do I need an umbrella tomorrow” is actually code for the real intent: ‘Is it going to rain tomorrow?’ which requires a very different response to the request for an umbrella.

Our understanding of natural language queries is pushing search way beyond index matching of keywords, and voice search is the catalyst for this development. As search algorithms get smarter though, our demand is moving beyond the semantic.

We want machines to become more human, to process information as we do and make sense of every stimulus. Not just text and speech, but images, smells and textures.

The next evolution of search is to be able to process, categorise and index these sensations as we currently process text. And the first step on that journey is to teach a machine to see. 

What makes a chair a chair?

As a masters graduate of philosophy I have spent many long hours discussing the metaphysical merits of what makes a chair a chair. However, philosophy aside, in terms of visual search this is a very real problem that we have to grapple with.

How do you give a search engine the parameters to decide what is a chair and what isn’t? How do we distil the essence of ‘chairness’ into something an algorithm can recognise? This sounds crazy until you try and describe what a chair is without simply pointing at one.

A quick Bing search will show you that the dictionary definition for chair is ‘a separate seat for one person, typically with a back and four legs.” But clearly there are so many varieties of chair that this is an inadequate description for a machine to go on.

In my kitchen for example, I have chairs with four legs and a back, just like the dictionary definition, but my office chair has one central leg and then five spider legs that come off it. Meanwhile, my colleague uses a kneeling posture chair that has neither legs nor a back.

And yet, as humans we can divine fairly easily that these are all chairs, implements for sitting on, without getting them confused with other sitting accessories like sofas. The complexity of defining rules for what constitutes a simple object like a chair is mind boggling.

Now imagine having to construct a set of rules and definitions not just for chairs, or indeed just for chairs and tables and other furniture, but for every object in the world. That is the task of the visual search engine. And, believe me, it is no small thing.

The breakthrough moment—Project Adam

There is a team who reside in Microsoft HQ in Redmond, Seattle, who have attempted to do the impossible: to categorise every known object in the world.

They’ve used the power of Bing in their quest to catalogue the billions of images that we index and then use the machine-learning algorithms that we have to learn millions of visual patterns and connections.  

What they came up with was Project Adam, a machine which could recognise the breed of any given dog just by taking a photo and then running that image against our vast catalogue of known data to make a correct match.

The technology and pattern recognition in Adam is phenomenal, and in 2014 we were able to correctly identify any one of thousands of pure and crossbreeds simply from taking a photo of the dog. We’ve now taken that technology to recognise complex objects that relate to each other.

Complex visual recognition—CaptionBot

Recognising a single dog is impressive, but for visual search to really work it needs to be able to understand the connection and juxtaposition of multiple objects together. CaptionBot ( is Microsoft’s first attempt to describe what is happening in an image.

Let’s take a fairly complex image like a photo from my wedding day.