Project description: Unterschied zwischen den Versionen
Aus Kallimachos
| Zeile 47: | Zeile 47: | ||
<br clear=all> | <br clear=all> | ||
===''anyOCR'': a self-learning OCR system=== | ===''anyOCR'': a self-learning OCR system=== | ||
The DFKI established the term ''anyOCR'' for an adaptable optical OCR method, which – in contrast to established OCR | The DFKI established the term ''anyOCR'' for an adaptable optical OCR method, which – in contrast to established OCR systems (i.e. systems based on atomic character segments without more coarse-grained segments like lines or paragraphs) – can adapt to different requirements and the specific problems of OCR for historical documents. Traditional segmentation-free OCR methods based on sequence learning could already be utilized for handwritten, diversly printed and historical documents and were able recognize complete lines of text at once and with a higher recognition rate than traditional segmentation-based OCR methods. However, to achieve satisfying results with these methods, a lot of manually transcribed training material is needed. The generation of this so called ground truth is time-consuming and expensive. Additionaly, the option of a synthetic generation of ground truth is not feasible in the domain of historical documents, as no representative text are available. | ||
<br clear=all> | <br clear=all> | ||
[[File:anyOCRtPipeline.png|600px|center|OCRoRACT-anyOCR Training Pipeline|link=|alt=training model of the anyOCR workflow]] | [[File:anyOCRtPipeline.png|600px|center|OCRoRACT-anyOCR Training Pipeline|link=|alt=training model of the anyOCR workflow]] | ||
<br clear=all> | <br clear=all> | ||
To deal with the problem of missing ground truth data for sequence learning, the DFKI has developed the framework ''OCRoRACT'' based on the ''anyOCR''-method. Here, a conventional character-based OCR method is deployed to train an initial OCR model using individually recognized symbols. The resulting lines of | To deal with the problem of missing ground truth data for sequence learning, the DFKI has developed the framework ''OCRoRACT'' based on the ''anyOCR''-method. Here, a conventional character-based OCR method is deployed to train an initial OCR model using individually recognized symbols. The resulting lines of text, which may be (in contrast to an actual ''ground truth'') flawed by errors, are then used to train the sequence learning model instead of the manually generated ground truth. By using contextual information, the system is able to learn how to correct the errors in this pseudo-ground truth. An OCRoRACT-System trained in this fashion for historical documents has proven to be able to deliver suitable recognition rates despite the imposed lack of the required dictionaries. | ||
<br clear=all> | <br clear=all> | ||
===Printshop-specific character inventories=== | ===Printshop-specific character inventories=== | ||
[[File:CollageOCR.png|thumbnail|Erstellung von Typentabellen am Beispiel des Teilprojekts [[Narragonien]].| link=http://kallimachos.de/kallimachos/images/kallimachos/0/03/CollageOCR.png | alt= | [[File:CollageOCR.png|thumbnail|Erstellung von Typentabellen am Beispiel des Teilprojekts [[Narragonien]].| link=http://kallimachos.de/kallimachos/images/kallimachos/0/03/CollageOCR.png | alt=collage of different letter inventories]] | ||
The OCR-Team at Würzburg University´s central library accompanies and evaluates the development process at the DFKI with the help of existing tools stemming from the EMOP project (''Franken+, Gamera, Tesseract''). With the help of our specially developed tool ''Glyph Miner'', specific inventories of letters are compiled for historic printers and publishers and coupled with a digital MUFI font type. These inventories allow for the creation of printer-specific training data for OCR, which can then be re-used to capture | The OCR-Team at Würzburg University´s central library accompanies and evaluates the development process at the DFKI with the help of existing tools stemming from the EMOP project (''Franken+, Gamera, Tesseract''). With the help of our specially developed tool ''Glyph Miner'', specific inventories of letters are compiled for historic printers and publishers and coupled with a digital MUFI font type. These inventories allow for the creation of printer-specific training data for OCR, which can then be re-used to capture further texts using the same sets of letters. With this printship-specific approach, we are already able to reach recognition rates of 93% and higher, which has not been reached on similar types of texts before. | ||
<br clear=all> | <br clear=all> | ||