NISUS Archives

October 2010

NISUS@LISTSERV.DARTMOUTH.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Reply To:
Date:
Wed, 20 Oct 2010 03:40:06 +0900
Content-Type:
text/plain
Parts/Attachments:
text/plain (18 lines)
On Oct 19, 2010, at 7:38 PM, THDW wrote:

> so I will probably be doing everything by hand.

Although I have never tried to process a scanned image with an OCR software, I have some experience in proof reading and correcting document files created from such images by someone else -- not a file in reality but 280 (?!) separate files, oh well. Then, from my own poor but disastrous experience, I'd like you to recommend...

1. If you can, exclude the header and footer when scanning your books. It is a bit tedious to remove those texts from OCRed file(s) even if you are familiar with regular expression.

2. The first thing you should do against OCRed file(s) is to apply a colour or something very visible on all numerals ("Find All AnyDigit" in PowerFind). "I" and "l" are often recognized as "1" (one) and "O" (uppercase o) as "0" (zero) and vice versa by some OCR softwares, it seems.

3. OCR softwares tend to fail in identifiying the case for isolated characters such as "p" in "p. 135" which is often recognized as "P. 135".


Kino
--

This is not directly related to your problem but a while ago I was asked to check French quotations in the third proof of a Japanese book to be reissued and was astonished to happen to find that not a few number of "he" in Hiragana ("be" and "pe" as well) are treated as "he" in Katakana and vice vesa. Indeed they look very similar in many fonts.

ATOM RSS1 RSS2