LISTSERV - LLTI Archives - LISTSERV.DARTMOUTH.EDU

--- Forwarded Message from "Richard Kunst" <[log in to unmask]> ---

>Reply-To: <[log in to unmask]>
>From: "Richard Kunst" <[log in to unmask]>
>To: "'Language Learning and Technology International Information Forum'"   
<[log in to unmask]>
>References:  <[log in to unmask]>
>In-Reply-To:  <[log in to unmask]>
>Subject: RE: #8932 Feedback on IRIS Asian OCR sw?
>Date: Sat, 13 Sep 2008 13:25:25 -0400
>Organization: Humanities Computing Lab
>Thread-Index: AckVo7GVSdqknohQRf6wp1WJKootogAG9wlQ

On Thu, 11 Sep 2008 12:16:28 -0400 Jose Rodriguez <[log in to unmask]> wrote

> Does anyone have any feedback on ReadIRIS Asian edition of their OCR
> software?

Dear Jose and list,

A few years ago I OCRed a few thousand pages of Chinese text using the Asian
Add-on to ReadIRIS ver.10. It was excellent, as is ReadIRIS in general, but as
with other OCR software, it had its frustrating quirks. It was sometimes
inferior to, sometimes better than the core ReadIRIS. (I assume the engine was
developed independently.) 

There were some characters which were so regularly misrecognized that I
gradually built up a list of "the usual suspects" to watch for during
post-editing. I have appended it below. And if that doesn$E2Aot pass through the
listserv, it is also on the following web page:

http://www.humancomp.org/misc/problem_characters_in_readiris_chinese_ocr.html 

Almost all of the text I OCRed was horizontal L-to-R, but as I recall, it worked
OK for vertical text too. I tried out the Japanese and Korean OCR as well with
good success, but didn't do more than a few small tests.

Best wishes,
Rick Kunst

_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/
The Humanities Computing Laboratory
A Nonprofit Education and Research Corporation
109 Lariat Lane, Suite B
Chapel Hill, NC 27517 USA
Tel. +1 919 656-5915
E-mail: [log in to unmask]
http://www.humancomp.org
_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/


************************************************

(Some characters used in editing ReadIRIS OCR output.)
$E7o$BD$E8aTM$E7oAE$E4PIA$E7(R)$BB$E4$BD$A4$E8aC$E5a$BA$E5ea$E6aa$E5O*$E5+/-+/-$
E7$BBU$E8BA$E7u=>$E5ea$E8Aa$E5O'$E6oB$E3AC$E5oa$E6u'$E5ea$E5cu$E6uTM$E6u'$E5ae$E
5#'$E5uu$E7i$B0$E5<>o$E5$A4(c)$E6ao

Correct     Incorrect
$E8$A6A (replace $E8Ac)
$E5e$AC (replace $E6oi)
$E5e=> (replace $E6oB)
$E6o$B0 (replace $E6o*)
$E5Nnil (replace JL)
$E4$BAU (replace $E4PIA)
$E8aC (replace $E6o*)
$E8AA (replace $E8AO)
$E5cu (replace $E5cdeg.)
$E5<>o (replace $E5AEa)
$E9oAE (replace $E9o$B4)
$E6o$A0 (replace $E5OE)
$E5<>e (replace $E4$BAe)
$E5ae (replace $E5aa)
$E7i$B0 (replace $E5ou)
$E5<>o (replace $E5AEa)
$E8$A6A (replace $E5$A6*)
$E7oN (replace $E5ua)

Incorrect
$E8aTM (replace *selectively* with $E7o$BD )
$E5aa (replace *almost* all with $E5a$BA )
$E7$BBN (replace *selectively* with $E7$BBU )
$E5a$BA (replace *selectively* with $E5+/-+/- )
$E4$BD$A4 (replace $E4$BDe$E5Aa$E5ATM$E4aeE$E7oPI$E7i$A6$E5Ao$E4oc etc.)
$E6aa$E7*u (replace $E6aa$E6deg.i...)
$E9pia$E9pie (replace $E9pia$E8aee,$E9pia$E8oi,$E9pi$A4$E8oi,$E9pia$E5o+/-)




***********************************************
LLTI is a service of IALLT, the International Association for
Language Learning (http://iallt.org/), and The Consortium for Language
Teaching
and Learning (http://www.languageconsortium.org/).
Join IALLT at http://iallt.org.
Otmar Foelsche, LLTI-Editor ([log in to unmask])
***********************************************