Project description: Unterschied zwischen den Versionen
Aus Kallimachos
| Zeile 47: | Zeile 47: | ||
<br clear=all> | <br clear=all> | ||
===''anyOCR'': a self-learning OCR system=== | ===''anyOCR'': a self-learning OCR system=== | ||
The DFKI established the term ''anyOCR'' for an adaptable optical OCR method, which – in contrast to established OCR systems (i.e. systems based on atomic character segments without more coarse-grained segments like lines or paragraphs) – can adapt to different requirements and the specific problems of OCR for historical documents. Traditional segmentation-free OCR methods based on sequence learning could already be utilized for handwritten, diversly printed and historical documents and were able recognize complete lines of text at once and with a higher recognition rate than traditional segmentation-based OCR methods. However, to achieve satisfying results with these methods, a lot of manually transcribed training material is needed. The generation of this so called ground truth is time-consuming and expensive. Additionaly, the option of | The DFKI established the term ''anyOCR'' for an adaptable optical OCR method, which – in contrast to established OCR systems (i.e. systems based on atomic character segments without more coarse-grained segments like lines or paragraphs) – can adapt to different requirements and the specific problems of OCR for historical documents. Traditional segmentation-free OCR methods based on sequence learning could already be utilized for handwritten, diversly printed and historical documents and were able to recognize complete lines of text at once and with a higher recognition rate than traditional segmentation-based OCR methods. However, to achieve satisfying results with these methods, a lot of manually transcribed training material is needed. The generation of this so called ''ground truth'' is time-consuming and expensive. Additionaly, the option of synthetically generating the required ground truth is not feasible in the domain of historical documents, as no representative text are available. | ||
<br clear=all> | <br clear=all> | ||
[[File:anyOCRtPipeline.png|600px|center|OCRoRACT-anyOCR Training Pipeline|link=|alt=training model of the anyOCR workflow]] | [[File:anyOCRtPipeline.png|600px|center|OCRoRACT-anyOCR Training Pipeline|link=|alt=training model of the anyOCR workflow]] | ||
<br clear=all> | <br clear=all> | ||
To deal with the problem of missing ground truth data for sequence learning, the DFKI has developed the framework ''OCRoRACT'' based on the ''anyOCR''-method. Here, a conventional character-based OCR method is deployed to train an initial OCR model using individually recognized symbols. The resulting lines of text, which | To deal with the problem of missing ground truth data for sequence learning, the DFKI has developed the framework ''OCRoRACT'' based on the ''anyOCR''-method. Here, a conventional character-based OCR method is deployed to train an initial OCR model using individually recognized symbols. The resulting lines of text, which (in contrast to an actual ''ground truth'') may be flawed by errors, are then used to train the sequence learning model instead of the manually generated ground truth. By using contextual information, the system is able to learn how to correct the errors in this pseudo-ground truth. An ''OCRoRACT''-System trained in this fashion for historical documents has proven to be able to deliver suitable recognition rates despite the imposed lack of the required dictionaries. | ||
<br clear=all> | <br clear=all> | ||
<br clear=all> | <br clear=all> | ||