GoogleDocs Rolls Out OCR Upload

Several months ago, allowed people to mass-upload all their files to Google Docs (also check out this Firefox mass-upload plugin) for storage and/or conversion.  (Help and Hints on how to do this)

Uploading and converting Microsoft Word, OpenOffice or RTF documents to is a great way to archive and be able to search your document library. But, if a document was a PDF or otherwise not actually text, it didn’t convert, and you were left with a whole, large file that bit into your Google Docs file storage.

Now, GoogleDocs rolls out OCR – Optical Character Recognition for PDF files to be uploaded.


From Google’s Help Files:

Files suitable for OCR can come from a number of sources:

  • Image or PDF files obtained using flatbed scanners
  • Images captured using digital cameras or mobile phones

Uploaded images or PDF files are used to extract text parts, which are converted into a Google document. 

For best extraction results, the image or PDF files need to meet certain requirements: 

  • Resolution: High-resolution files work best. As a rule of thumb, we recommend each line of text in the documents to be of at least 10 pixels height.
  • Orientation: Only documents with horizontal left-to-right text are recognized. If you have accidentally scanned or captured a document in a different orientation, please use an image manipulation program to rotate the images before uploading to Google docs.
  • Languages, fonts and character sets: Our OCR engine supports only Latin character sets at this stage, so for example Japanese text, Arabic text, or hand written text will not be detected. Common fonts such as Arial and Times New Roman will yield in best results.
  • Image quality: Sharp images with even lighting and clear contrasts will work best. Motion blur or bad camera focus will decrease the quality of the detected text.

By the way, this could be a great "poor-man’s" ATS… a way for storing/searching/retrieving resumes, if you are a recruiter.


