NISUS Archives

October 2010

NISUS@LISTSERV.DARTMOUTH.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Philip Spaelti <[log in to unmask]>
Reply To:
Date:
Tue, 19 Oct 2010 19:20:22 +0900
Content-Type:
text/plain
Parts/Attachments:
text/plain (20 lines)
On 18. Oct 2010, at 19:29 , THDW wrote:

> What is the easiest way to scan and ocr a novel with the Mac ?

My first question is why do you need to OCR them? You wrote them (with Nisus?). Don't you have the files?

The easiest way to scan is to use a machine with an automatic feeder. I have a copier here that can scan pages and store them on a giga stick. With the automatic feeder that would take half an hour (or less) for 250 pages. This would work if you have an extra copy that you can sacrifice. Cut off the spine and feed it through.

If you can't or don't want to sacrifice the book, you can scan double pages, turning the pages by hand. With a cheap table top scanner this will take about a minute per page, so 2 (boring) hours, if you keep at it, for 125 double pages.

OCR is pretty fast (faster than scanning by hand!), but results are variable. I have got near perfect results on scans that were short, with nice clear type, and 100% in English. But for most other things you will really need to spend many hours correcting the results.

I wonder if you are expecting OCR to recreate the formatting of the published book. I generally don't bother with having the OCR software reproduce the formatting. With the kind of stuff that I OCR I find that the results are simply unusable, despite exaggerated claims by OCR software manufacturers.

In my own estimation when projects, such as Google Books, scan and OCR books they don't actually reproduce the book as an electronic file. Rather they seem to keep the scanned images for display, and use (rough) OCR for searching purposes. That certainly speeds the process, since you don't need to proofread the shoddy OCR results.


Philip Spaelti
[log in to unmask]

ATOM RSS1 RSS2