Google announced that its web crawlers are now capable of indexing textual data from images. Google’s interest in Optical Character Recognition had been evident in the past few years. Now, they are one more step closer to making all the information in the world searchable.
Google Blog aptly summarizes the new technology:
In the past, scanned documents were rarely included in search results as we couldn’t be sure of their content. We had occasional clues from references to the document– so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe’s PDF format. This Optical Character Recognition (OCR) technology lets us convert a picture (of a thousand words) into a thousand words — words that can be searched and indexed, so that these valuable documents are more easily found. This is a small but important step forward in our mission of making all the world’s information accessible and useful.
Early this year Google had made public details on a patent it had filed to read text from images and video. This also implied that text recognized from Google Map View and Street View could also eventually shop up on search results. The applications of this technology are immense. But the next progression is what will make all the difference.
Already technologies exist that index speech spoken in videos. With addition of capability to index the text in videos as well, product oriented search will become all more relevant. This might be great news for the business model driving search engines but raises several privacy issues as well.
Originally posted on November 5, 2008 @ 5:02 pm