Saturday, July 21, 2007

Tesseract OCR, or basic PDF to text translation

Being a townhouse association president isn't as glamorous as it sounds. Sure, there's all that power from which comes an endless supply of women, and the opportunity to destroy people's lives.

Paperwork from my townhome's foundation was printed in 1974. In those olden days, I went to nursery school and nobody stored documents like these on a computer. My townhouse CC&Rs were sent to me via email, but as PDFs, mere photocopies of the original documents.

Thanks to an article on PDF-to-text translation and some patience, I managed to get my townhouse documents in text format.

Note that it's not great translation, but much better than typing it yourself.

Groklaw - Tesseract OCR How-To, by Dr Stupid; Scripts by Fred Smith

1 comment:

Anonymous said...

You can do this without downloading and installing Tesseract on A Billion Billion - Free OCR for Everyone although currently you will need to convert the PDF to TIFF yourself.