On Augmented Reality

Overlaying text on the real world.  Image courtesy of HowStuffWorks.com.

Overlaying text on the real world. Image courtesy of HowStuffWorks.com.

This year I switched groups at Microsoft, from the Zune marketplace group to Bing Mobile’s Augmented Reality division.  At Zune, I worked on a lot of engaging technology, including recommender systems based on directed acyclic graph processing (similar to Hadoop) and a continuous playlist generator called Smart DJ.  While I valued and enjoyed the work I did at Zune, I also felt ready to tackle something more nascent in media technology: active processing of the environment on mobile devices.  Thus began my stint into augmented reality (AR).

So aside from the sci-fi term of art, what is AR?  My first exposure of what this technology is appeared in the late 1980s or early 1990s.  I remember reading about a program for the Commodore Amiga that enabled a person to trigger drum sounds by striking virtual planes in the air.  A video camera was used to capture the human figure, and the software defined targets that corresponded to drum sounds.  Whenever the computer detected one of the planes was crossed, it would trigger the drum sound.  This illustrates the canonical definition for AR: a modality that combines the real world and virtual world, is interactive and is a real-time experience, and within it the virtual world is mapped to the virtual world in 3-dimensional space.

Word Lens, an application for the iPhone, translates text automatically

Word Lens, an application for the iPhone, translates text automatically

Other examples of AR come from the movies.  Didn’t it seem that every scene that depicted robotic vision overlaid infographics onto a real-time video of what the robot saw?  What was fiction five or ten years ago is coming of age now, thanks to pervasive wireless Internet connectivity, powerful handheld computers (you know, those computers you can also use as a telephone), and advances in computer vision algorithms.  Now it’s possible to use a handheld computer to source and process video, register the video with location and orientation of the camera using GPS, compass, and gyroscope sensors, and communicate with vast computing resources in the cloud over high-speed networks, all in real-time.

This sounds cool, but does the technology have potential?  Some of us remember the heady euphoria of so-called “virtual reality” in the 1990s, with technologies like VRML, backpack computers, and video goggles promising to usher in a new age of human-computer interaction.  In the ’90s, computers were not fast enough or small enough for this concept to work in true real-time, so they ultimately failed.  Users of virtual reality goggles reported the processing lag making them dizzy.

Yelp's monocle feature overlays business information on video captured from the real world.  The balloons appear near where the businesses are located.

Yelp's monocle feature overlays business information on video captured from the real world. The balloons appear near where the businesses are located.

Forrester Research believes there is potential for AR “to trigger disruption in the years to come and open up new opportunities.”  Some companies are already tinkering with AR layers to drive interest and commerce.  Take, for example, the recently-released application Word Lens for the iPhone.

Word Lens allows you to point your phone’s camera at a sign and translate the words into another language.  Is this AR?  Yes: it blends the virtual world with the physical, is registered in 3D, and is a real-time experience.  Is it useful?  I haven’t traveled yet with Word Lens, but my guess is it could be useful in certain situations.  I think the jury is still out on startup time, though.  On the iPhone, you have to unlock the phone, launch the app, choose languages, and then point the camera at the sign.  Those are a lot of steps to follow, though granted almost all AR apps require about the same routine before you get what you want.

One of the first AR applications for iPhone was Yelp’s application.  Yelp is a crowdsourced review engine, and its mobile app is great for finding businesses or restaurants nearby.  The monocle feature took this a step further by turning on the camera and overlaying balloons onto the physical world with Yelp’s review scores near where the businesses exist.  When you rotate your phone, the balloons rotate with it.  This is great to make sure you’re going in the right direction to that highly reviewed falafel joint.

At Bing Mobile, we introduced Augmented Reality features in the Bing for iPhone application.  Like with Goggles, you can use your phone’s camera to recognize parts of the real world and do things in the virtual world with them.  We introduced a new view in version 2.0 called Bing Vision, and the way it works follows.  You enter camera mode and point your phone’s camera at points of interest in the world.  We have a set of recognizers that process the video from your camera and try to extract information from the image.  For example, whenever we detect that there is text in the image, we flash an indicator telling you we can turn that text into a search query.  Or whenever we recognize a barcode, an indicator points to the barcode and automatically searches for the barcode in Bing’s index.  You can do this with a variety of objects.

Cover art is a great example.  Take a picture of a book, CD, poster, etc., and automatically conduct a web search:

The cover of this book is recognized by iBing

The cover of this book is recognized by iBing

Search results for the cover art

Search results for the cover art

Barcodes are another cool example.  Whenever iBing detects a barcode in the image it displays, an indicator points to the barcode and it automatically searches for the barcode in Bing’s index to find out what it is, how much it costs, where you can find it, and so forth:

The indicator points to the barcode iBing recognizes.

The indicator points to the barcode iBing recognizes.

Bing found the Moleskine I scanned and is ready to help me buy another one.

Bing found the Moleskine I scanned and is ready to help me buy another one.

Scanning text is interesting.  Our application finds text in the image and then parses it to allow you to conduct web searches.  In the following example, pretend I’m reading my DSP book and want to find more information on a topic therein, the theorem of convolution.  First I point the camera at the text.

The "Aa" circle you see indicates iBing recognizes text in the image.

The "Aa" circle you see indicates iBing recognizes text in the image.

I tap the camera button to tell iBing I am interested in searching based on some of this text, so it parses the text using optical character recognition.

iBing is processing the text on the image I selected.

iBing is processing the text on the image I selected.

When the text is processed, I can touch words to add them to my search query.

I tapped the words "theorem" and "convolution."

I tapped the words "theorem" and "convolution," and now they are part of my search query.

Hitting the search button looks for more information on Bing.  I like how this works, and it’s usually faster than just typing the words myself.  This is the first version of our text feature, so it is a little limited today.  But we have big plans for it, so stay tuned.

More information about my selected topic, the theorem of convolution.

More information about my selected topic, the theorem of convolution.

I had a chance to use iBing this holiday season to actually do something useful rather than just show off my group’s work.  My wife and mother-in-law were shopping with me one day for some audio-visual equipment.  The store was out of the product we wanted, but we happened to see a box on the floor that had a similar model number to what we wanted.  We didn’t see this product in the store at all.  So I pointed iBing’s barcode reader to the box, and it told us everything we needed to know: the model had similar features to the one we originally wanted, it had a slightly more expensive price, and we saw that we could purchase the product at lots of nearby stores that day.  Win!

So does iBing 2.0 satisfy the definition of augmented reality?  It ties the physical world with the virtual world, and it is a real-time experience.  One might say the experience is registered in 3D, but that is a bit of a stretch.  It’s close, and as time goes on we want to get closer and closer to a real-world experience.  Personally, I’m glad to have made the switch into this new world professionally.  There are a lot of fun problems to solve, and I like being on the cutting edge of something new and exciting.

Posted: December 29th, 2010 | Tags: augmented reality, bing | No Comments »

Recent Posts

  • Revisiting and Rediscovering the V-Synth
  • Removing Roadblocks
  • Old Samplers, SCSI, and Modern Computers
  • Boutique Musical Electronics
  • On Augmented Reality

Archives

  • December 2011
  • September 2011
  • March 2011
  • December 2010
  • November 2010
  • September 2010
  • August 2010
  • May 2010
  • March 2010
  • December 2009
  • November 2009
  • October 2009
  • August 2009
  • July 2009
  • June 2009

Categories

  • about me
  • aesthetics
  • experiences
  • music information retrieval
  • recording
  • signalprocessing
  • trends

© Copyright 2012 | Signals | All Rights Reserved