The BBC has published a very interesting article about an initiative from the Internet Archive. Based on a technology developed by Kalev H. Leetaru, from Georgetown University, a Flickr page has been created to offer a pool of millions of images (already 2.6 million) included in scanned books from the Internet Archive.
Here’s an extract of the BBC article on how it works:
The Internet Archive had used an optical character recognition (OCR) program to analyse each of its 600 million scanned pages in order to convert the image of each word into searchable text. As part of the process, the software recognised which parts of a page were pictures in order to discard them.
Mr Leetaru’s code used this information to go back to the original scans, extract the regions the OCR program had ignored, and then save each one as a separate file in the Jpeg picture format.
And here is the Flickr page: https://www.flickr.com/photos/internetarchivebookimages/