Project description: Unterschied zwischen den Versionen

Aktuelle Version vom 11. Mai 2018, 11:56 Uhr

Project description

KALLIMACHOS merges humanists and computer scientists in a regional digital humanities center. Already existing cooperations and competencies at Würzburg University are complemented by new partnerships with the DFKI Kaiserslautern (OCR) and the University of Erlangen-Nürnberg (Linguistic Informatics). The constution of the new hub is sponsored by the Federal Ministry of Education and Research as a part of the e-Humanities funding programme.

Our main point of interest lies in the supervision and coordination of digital editions and the application of quantitative analysis via different methods of text mining, i.e. stilometric analysis, topic modeling and named entity recognition. We seek to offer our partners the technical and social infrastructure needed to answer a broad palette of research questions in the humanities based on digital methods.

From a technical point of view, our work includes the development and provision of the required software components and the establishment of prototypical workflows to be incorporated in existing infrastructures. In this regard, we emphasize long-term availability, maintenance and archiving for the projects, portals and data in our care. Insofar, KALLIMACHOS strives to build an integrated structure for research data in the humanities.

From a more social point of view, we also promote a constant exchange between regional and trans-regional projects in the digital humanities through annual conferences and workshops. By providing advice and training, we introduce experts and newcomers into new and exciting methods and research fields.

[bearbeiten]

Prototypical workflows for editioning and data analysis

Starting from our subprojects as Use Cases, we establish prototypical workflows for data assimilation and analysis in the humanities in a way comprehensible for our target audiences. In our workflow management system Wüsyphus II, established tools can be lined into work chains. Through internal and public training courses, these solutions are propagated among a broad audience in the context of Digital Humanities. The newfound best-practice implementations are integrated into the workflow, empirically validated in the context of our Use Cases und finally provided to the research community. Thus, the established workflows can be easily replicated with similar datasets.

Not every Subproject has to be handed down through all the stations of the Wüsyphus II workflow system. If, for example, a new project already has access to high-quality Scans of a literary corpus, there´s no need to reproduce these Scans for a second time. However, the establishment of an individual project workflow based on previously found solutions is mandatory for every project maintained by KALLIMACHOS. The following graph shows which "links" in our workflow chain are significant for our current Subprojects.

Capturing the originals

The Center for digitalisation is hosted by Würzburg University´s Central Library and provides the technology and the trained personell necessary for new high-quality digitalisations and re-digitalisation alike. Even for usually troublesome cases, innovative solutions are at hand: For instance, our specially manufactured book rocker allows for the scanning of books that only offer an opening angle of up to 60° or more, thus ensuring the proper conservation of the often highly valuable originals. For large size posters, an innovative suction wall is available as well.

Metadata editor

The already existing metadata editor at Würzburg University´s center for digitalisiation allows for the centralized maintenance of a wide array of predefined metadata records for manuscripts, incunabula, newer printed publications and graphics. For the development of our workflow management system WüSyphus II, extensive optimizations of the online performance and the user interface are planned. The upgraded metadata editor will also be able to handle additional categories of metadata, for example those needed to describe historic artifacts and other kinds of realia.

OCR module

representation of the OCR process. On the left: The original scans as a grayscale image. On the left: The digital Text. — OCR on Scans of a latin Narrenschiff edition.

Our OCR module provides an automated preprocessing system for the creation of digital text files. Two working groups, one at the DFKI Kaiserslautern and one at Würzburg University, seek to develop and cultivate new and existing tools and software components that are able to tap into texts that previously weren´t suitable for satisfying OCR solutions. The current focal point of these efforts is our Use Case Narragonien.

anyOCR: a self-learning OCR system

The DFKI established the term anyOCR for an adaptable optical OCR method, which – in contrast to established OCR systems (i.e. systems based on atomic character segments without more coarse-grained segments like lines or paragraphs) – can adapt to different requirements and the specific problems of OCR for historical documents. Traditional segmentation-free OCR methods based on sequence learning could already be utilized for handwritten, diversly printed and historical documents and were able to recognize complete lines of text at once and with a higher recognition rate than traditional segmentation-based OCR methods. However, to achieve satisfying results with these methods, a lot of manually transcribed training material is needed. The generation of this so called ground truth is time-consuming and expensive. Additionaly, the option of synthetically generating the required ground truth is not feasible in the domain of historical documents, as no representative text are available.

training model of the anyOCR workflow — OCRoRACT-anyOCR Training Pipeline

To deal with the problem of missing ground truth data for sequence learning, the DFKI has developed the framework OCRoRACT based on the anyOCR-method. Here, a conventional character-based OCR method is deployed to train an initial OCR model using individually recognized symbols. The resulting lines of text, which (in contrast to an actual ground truth) may be flawed by errors, are then used to train the sequence learning model instead of the manually generated ground truth. By using contextual information, the system is able to learn how to correct the errors in this pseudo-ground truth. An OCRoRACT-System trained in this fashion for historical documents has proven to be able to deliver suitable recognition rates despite the imposed lack of the required dictionaries.

Printshop-specific character inventories

collage of different letter inventories — Erstellung von Typentabellen am Beispiel des Teilprojekts Narragonien.

The OCR-Team at Würzburg University´s central library accompanies and evaluates the development process at the DFKI with the help of existing tools stemming from the EMOP project (Franken+, Gamera, Tesseract). With the help of our specially developed tool Glyph Miner, specific inventories of letters are compiled for historic printers and publishers and coupled with a digital MUFI font type. These inventories allow for the creation of printer-specific training data for OCR, which can then be re-used to capture further texts using the same sets of letters. With this printshop-specific approach, we are already able to reach recognition rates of 93% and higher, which has not been reached on similar types of texts before.

Synoptic editing tools

The synoptic transcription editor. — The synoptic editor for the simplified edition of OCR output used for the *use case* Narragonien.

This module establishes a framework for online editing tools that enables the users to view texts and images side by side for annotation and text-image linking. These editors can be tailored to suit different project specific requirements. The resulting intuitively usable web-based edition tools can be used without a need for deeper insight into XML and other contemporary text encoding formats, which has been proven to be especially useful for the manual correction of OCR output. In concert with the user management system and the editorial infrastructure provided by the WÜsyphus II workflow system, this allows for the organized inclusion of research assistants, students and even interested “laymen” in the editing process.

Data exports und online presentation

The annotated texts, images und additional datatypes will be transferrable into various established export and interchange formats, depending on the individual project requirements. For instance, the data exchange with the TextGrid-Repository can be enabled through XML encoding conforming to the TEI standards. Beneath these export options, individual solutions for presenting data will be available for the project´s web portal. Especially the preceded framework for synoptic editors will also be reusable for building a synopic viewer suitable for presentation of references between images and texts. For instance, the scans used for the edition, the initial OCR text, manual transcriptions, localisations, annotations and metadata can be viewed, hidden or highlighted simultaneously.

Semantic MediaWiki

Based on Semantic MediaWiki, an open-source expansion of the MediaWiki system (best known as the scaffolding for Wikipedia and many other wiki systems), an easily usable and quickly adaptable web 3.0-component can be provided for the processing, structuring and presentation of various datasets. Through MediaWiki´s user management system and the automated versioning of changes, SMW is suited especially well for the implementation of crowdsourcing into a project´s workflow. For the transfer of data from the wiki environment WüySyphus II, new interfaces and import routines are to be developed. For less challenging projects, SMW can also be used directly as a means of presenting data. The options for searches and queries already incorporated in SWM are especially convenient for the implementation of primarily database-driven projects like academic source catalogs.

Textual analysis interface

Building upon our textual analysis use cases, this module supports:

The aggregation of a corpus of texts to be analyzed from the TextGrid-Repository and WÜsyphus II based on their metadates,

The preparation of the chosen texts and their metadata for analysis,

The analysis in UIMA and finally

The incorporation of the results in TextGrid by reassigning UIMA annotations to TEI.

These steps can be customized and generalized to be reusable in future projects. Perspectively, even novices and “laymen” in the field of data analysis will be able to profit from automated analysis methods, which can, for instance, be used for the recognition of grammatic cases and structures or keywords in a text. As a data transfer format between the textual textual analysis and the WüSyphus II workflow system, the CoNLL format is proposed.

Versioning und archiving

A crucial and often neglected factor regarding the success of digital projects not only in the humanities is the conclusive guarantee for long-term reproducibility and reusability of the underlying data. For “living”, i.e. for continuously maintained and expanded data collections and corpora, ensuring data security is of primary concern. To ensure proper versioning, Git-based Systems are envisioned in addition to our wiki systems. Alongside the stable availability and versioning of datasets, methods or long term archiving are to be implemented as well.

[bearbeiten]

Coordination

Zentrum für Philologie und Digitalität

Am Hubland

D-97074 Würzburg

Telefon: 0931/31-80534

E-mail

Prof. Dr. Frank Puppe (Project manager)

Dr. Herbert Baier-Saip (System development and administration)

Jonathan Gaede (Wki systems and subproject communication)

Partners at Würzburg University

Textmining

Lehrstuhl für Computerphilologie und Neuere Deutsche Literaturgeschichte

Am Hubland, Bau 8

D-97074 Würzburg

Tel.: 0931-31 88421

E-Mail

Prof. Dr. Fotis Jannidis

Lehrstuhl für Künstliche Intelligenz und Angewandte Informatik (Informatik VI)

Arbeitsgruppe Data Mining und Information Retrieval

Am Hubland

D-97074 Würzburg

Tel.: 0931-31 86731

Prof. Dr. Frank Puppe
Prof. Dr. Andreas Hotho
Dipl.-Math. Lena Hettinger

Segmentation and OCR

Lehrstuhl für Algorithmen, Komplexität und wissensbasierte Systeme (Informatik I)

Am Hubland

D-97074 Würzburg

Tel.: 0931-31 850541

Benedikt Budig, M.Sc.
Dr. Thomas van Dijk

Lehrstuhl für Künstliche Intelligenz und Angewandte Informatik (Informatik VI)

Am Hubland

D-97074 Würzburg

Tel.: 0931-31 86731

Christian Reul, M.Sc.

Project group Narragonien digital

Neuphilologisches Institut / Romanistik

Lehrstuhl für Französische und Italienische Literaturwissenschaft

Am Hubland, Bau 5

D-97074 Würzburg

Tel.: 0931 31-85681

Prof. Dr. Brigitte Burrichter

Viktoria Walter

Lehrstuhl für deutsche Philologie, Ältere Abteilung

Professur für deutsche Philologie, insb. Literaturgeschichte des späten Mittelalters und der frühen Neuzeit

Am Hubland, Bau 4

D-97074 Würzburg

Tel.: 0931 31-81679

Prof. Dr. Joachim Hamm

Christine Grundig M.A.

Project group Anagnosis

Institut für Klassische Philologie

Lehrstuhl I (Gräzistik)

Residenzplatz, 2 (Südflügel)

D-97070 Würzburg

Prof. Dr. Dr. h.c. Michael Erler

AR Dr. Holger Essler

Vincenzo Damiani, M.A.

Project group Schulwandbilder digital

Lehrstuhl für Systematische Bildungswissenschaft

Forschungsstelle Historische Bildmedien

Campus Hubland Nord

Oswald-Külpe-Weg 86

D-97074 Würzburg

Tel.: 0931 31 89672

E-mail

Univ.-Prof. Dr. phil. habil. Andreas Dörpinghaus (Chair holder)
Dr. phil. Ina Uphoff (Project director)
Dipl. Päd. Eva Zimmer, M.A. (Vice project director)

Project group Identifikation von Übersetzern

Institut für Philosophie

Philosophie- und Wissenschaftsgeschichte der griechisch- arabisch- lateinischen Tradition

Residenz - Südflügel

D-97070 Würzburg

Tel. 0931 31 2778

Prof. Dr. Dag Nikolaus Hasse

Andreas Büttner, B.A.

Jonathan Maier

Project group Romangattungen

Lehrstuhl für Computerphilologie und Neuere Deutsche Literaturgeschichte

Am Hubland, Bau 8

D-97074 Würzburg

Tel.: 0931-31 88421

E-Mail

Prof. Dr. Fotis Jannidis

Isabella Reger

Lehrstuhl für Künstliche Intelligenz und Angewandte Informatik (Informatik VI)

Arbeitsgruppe Data Mining und Information Retrieval

Am Hubland

D-97074 Würzburg

Tel.: 0931-31 86731

Dipl.-Math. Lena Hettinger

Project group Romanfiguren

Lehrstuhl für Computerphilologie und Neuere Deutsche Literaturgeschichte

Am Hubland, Bau 8

D-97074 Würzburg

Tel.: 0931-31 88421

E-Mail

Prof. Dr. Fotis Jannidis

Isabella Reger

Lehrstuhl für Künstliche Intelligenz und Angewandte Informatik (Informatik VI)

Arbeitsgruppe Data Mining und Information Retrieval

Am Hubland

D-97074 Würzburg

Tel.: 0931-31 86731

Prof. Dr. Frank Puppe

Markus Krug, M.Sc.

External Partners

Professur für Korpuslinguistik (FAU Erlangen-Nürnberg)

Bismarckstr. 6

91054 Erlangen

Tel.: +49 09131 85-29251

E-mail

Prof. Dr. Stefan Evert

Thomas Proisl, M.A.

Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI) GmbH

Forschungsgruppe Wissensmanagement

Trippstadter Straße 122

67663 Kaiserslautern

Tel.: 0631 20575-1000

E-Mail

Prof. Dr. Andreas Dengel

Dr. Syed Saqib Bukhari

@@ Zeile 4: / Zeile 4: @@
 ==Project description==
-KALLIMACHOS merges humanists, computer scientists and librarians in a regional digital humanities center. The already existing cooperations and competencies at Würzburg University are complemented by new partnerships with the [http://www.dfki.de/web/kontakt/dfki-kaiserslautern DFKI Kaiserslautern] (OCR) and the [http://www.linguistik.uni-erlangen.de/index.shtml University of Erlangen-Nürnberg] (Linguistic Informatics). The constution of the new hub is sponsored by the Federal Ministry of Education and Research as a part of the e-Humanities funding programme.
+KALLIMACHOS merges humanists and computer scientists in a regional digital humanities center. Already existing cooperations and competencies at Würzburg University are complemented by new partnerships with the [http://www.dfki.de/web/kontakt/dfki-kaiserslautern DFKI Kaiserslautern] (OCR) and the [http://www.linguistik.uni-erlangen.de/index.shtml University of Erlangen-Nürnberg] (Linguistic Informatics). The constution of the new hub is sponsored by the Federal Ministry of Education and Research as a part of the e-Humanities funding programme.
-Our main point of interest lies in the supervision and coordination of digital editions and the application of quantitative analysis via different methods of text mining, i.e. stilometric analysis, topic modeling and named entity ''recognition''. We seek to offer our partners the technical and social infrastructure needed to answer a broad palette of research questions in the humanities based on digital methods.
+Our main point of interest lies in the supervision and coordination of digital editions and the application of quantitative analysis via different methods of text mining, i.e. ''stilometric analysis, topic modeling'' and ''named entity recognition''. We seek to offer our partners the technical and social infrastructure needed to answer a broad palette of research questions in the humanities based on digital methods.
 From a technical point of view, our work includes the development and provision of the required software components and the establishment of prototypical workflows to be incorporated in existing infrastructures. In this regard, we emphasize long-term availability, maintenance and archiving for the projects, portals and data in our care. Insofar, KALLIMACHOS strives to build an integrated structure for research data in the humanities.
@@ Zeile 12: / Zeile 12: @@
 From a more social point of view, we also promote a constant exchange between regional and trans-regional projects in the digital humanities through annual conferences and workshops. By providing advice and training, we introduce experts and newcomers into new and exciting methods and research fields.
-=Workplan=
+=Work plan=
 ==Prototypical workflows for editioning and data analysis==
-Starting from our subprojects as ''Use'' Cases, we establish prototypical workflows for data assimilation and analysis in the humanities in a way comprehensible for our target audience. In our workflow management system ''Wüsyphus II'', established tools can be lined into work chains. Through internal and public training courses, these solutions are propagated among a broad audience in the context of Digital Humanities. The newfound best-practice implementations are integrated into the workflow, empirically validated in the context of our Use ''Cases'' und finally provided to the research community. Thus, the established workflows can be easily replicated with similar datasets.
+Starting from our subprojects as ''Use'' Cases, we establish prototypical workflows for data assimilation and analysis in the humanities in a way comprehensible for our target audiences. In our workflow management system ''Wüsyphus II'', established tools can be lined into work chains. Through internal and public training courses, these solutions are propagated among a broad audience in the context of Digital Humanities. The newfound best-practice implementations are integrated into the workflow, empirically validated in the context of our Use ''Cases'' und finally provided to the research community. Thus, the established workflows can be easily replicated with similar datasets.
-Not every Subproject has to be handed down through all the stations of the ''Wüsyphus II'' workflow system. If, for example, a new project already has access to high-quality Scans of a certain literary corpus, there´s no need to reproduce these Scans another time. However, the establishment of an individual project workflow based on previous solutions is mandatory for every project maintained by KALLIMACHOS. The following graph shows which "chain links" are significant for our current Subprojects.
+Not every Subproject has to be handed down through all the stations of the ''Wüsyphus II'' workflow system. If, for example, a new project already has access to high-quality Scans of a literary corpus, there´s no need to reproduce these Scans for a second time. However, the establishment of an individual project workflow based on previously found solutions is mandatory for every project maintained by KALLIMACHOS. The following graph shows which "links" in our workflow chain are significant for our current Subprojects.
-[[File:WFUC.png | link= |alt=participation of the use Cases Narragonien digital, Anagnosis, Schulwandbilder digital,
+[[File:WFUC_en.png | link= |alt=participation of the use Cases Narragonien digital, Anagnosis, Schulwandbilder digital,
 Narrative Techniken und Untergattungen, Leserlenkung in Bezug auf Figuren and Identifizierung Anonymer Übersetzer in our workflow system.]]
 ==Capturing the originals ==
-The Center for digitalisation is hosted by Würzburg University´s Central Library and provides the technology and the trained staff necessary for new high-quality digitalisations and Re-digitalisation alike. Even for usually troublesome cases, innovative solutions are at hand: For instance, our specially manufactured book rocker allows for the scanning of books that only offer an opening angle of up to 60° or more, thus ensuring the proper conservation of the often highly valuable originals. For large size posters, an innovative suction wall is available as well.
+The Center for digitalisation is hosted by Würzburg University´s Central Library and provides the technology and the trained personell necessary for new high-quality digitalisations and re-digitalisation alike. Even for usually troublesome cases, innovative solutions are at hand: For instance, our specially manufactured book rocker allows for the scanning of books that only offer an opening angle of up to 60° or more, thus ensuring the proper conservation of the often highly valuable originals. For large size posters, an innovative suction wall is available as well.
 <!--[[File: Ulf am Scanner.jpg | 200px]]
@@ Zeile 39: / Zeile 39: @@
 ==Metadata editor==
 [[File:MetaEditor.png | thumbnail | The user interface of our current metadata editor | link=http://kallimachos.de/kallimachos/images/kallimachos/1/11/MetaEditor.png | alt= The user interface of our current metadata editor]]
-The already existing metadata editor at the center for digitalisiation allows for the centralized maintenance of a wide array of predefined metadata records for manuscripts, incunabula, print publications and graphics. For the development of our workflow management system ''WüSyphus II'', extensive optimizations of the online performance and the user interface are planned. The upgraded metadata editor will also be able to handle additional categories of metadata, for example those needed to describe historic artifacts and other kinds of realia.
+The already existing metadata editor at Würzburg University´s center for digitalisiation allows for the centralized maintenance of a wide array of predefined metadata records for manuscripts, incunabula, newer printed publications and graphics. For the development of our workflow management system ''WüSyphus II'', extensive optimizations of the online performance and the user interface are planned. The upgraded metadata editor will also be able to handle additional categories of metadata, for example those needed to describe historic artifacts and other kinds of realia.
 <br clear=all>
 ==OCR module==
 [[File:NarragonienOCR.png|thumbnail|OCR on Scans of a latin [[Narragonien|Narrenschiff]] edition.|link=http://kallimachos.de/kallimachos/images/kallimachos/d/d0/NarragonienOCR.png|alt=representation of the OCR process. On the left: The original scans as a grayscale image. On the left: The digital Text.]]
-Our OCR module provides an automated preprocessing system for the creation of digital full text. Two working groups, one at the [https://www.dfki.de/web DFKI Kaiserslautern] and one at Würzburg University, seek to develop and cultivate new and existing tools and software components that are able to tap into texts that previously weren´t  suitable for satisfying OCR solutions. The current focal point of these efforts is our Use Case [[Narragonien:Main | Narragonien]].
+Our OCR module provides an automated preprocessing system for the creation of digital text files. Two working groups, one at the [https://www.dfki.de/web DFKI Kaiserslautern] and one at Würzburg University, seek to develop and cultivate new and existing tools and software components that are able to tap into texts that previously weren´t  suitable for satisfying OCR solutions. The current focal point of these efforts is our ''Use Case'' [[Narragonien:Main | Narragonien]].
 <br clear=all>
 ===''anyOCR'': a self-learning OCR system===
-The DFKI established the term ''anyOCR'' for an adaptable optical OCR method, which – in contrast to established OCR-Systems (i.e. systems based on atomic character segments without more coarse-grained segments like lines or paragraphs) – can adapt to different requirements and the specific problems of OCR for historical documents. Traditional segmentation-free OCR methods based in sequence learning could already be utilized for handwritten, diversly printed and historical documents and were able recognize complete lines of text at once and with a higher recognition rate than traditional segmentation-based OCR methods. However, to achieve satisfying results with these methods, a lot of manually transcribed training material is needed. The generation of this so called ground truth is time-consuming and expensive. Additionaly, the option of a synthetic generation of ground truth is not feasible in the domain of historical documents, as no representative text are available.
+The DFKI established the term ''anyOCR'' for an adaptable optical OCR method, which – in contrast to established OCR systems (i.e. systems based on atomic character segments without more coarse-grained segments like lines or paragraphs) – can adapt to different requirements and the specific problems of OCR for historical documents. Traditional segmentation-free OCR methods based on sequence learning could already be utilized for handwritten, diversly printed and historical documents and were able to recognize complete lines of text at once and with a higher recognition rate than traditional segmentation-based OCR methods. However, to achieve satisfying results with these methods, a lot of manually transcribed training material is needed. The generation of this so called ''ground truth'' is time-consuming and expensive. Additionaly, the option of  synthetically generating the required ground truth is not feasible in the domain of historical documents, as no representative text are available.
 <br clear=all>
 [[File:anyOCRtPipeline.png|600px|center|OCRoRACT-anyOCR Training Pipeline|link=|alt=training model of the anyOCR workflow]]
 <br clear=all>
-To deal with the problem of missing ground truth data for sequence learning, the DFKI has developed the framework ''OCRoRACT'' based on the ''anyOCR''-method. Here, a conventional character-based OCR method is deployed to train an initial OCR model using individually recognized symbols. The resulting lines of Text, which may be (in contrast to an actual ground truth) flawed by errors, are then used to train the sequence learning model instead of the manually generated ground truth. By using contextual information, the system is able to learn how to correct the errors in this pseudo- ground truth. An OCRoRACT-System trained in this fashion for historical documents has proven to be able to deliver suitable recognition rates despite the imposed lack of the required dictionaries.
+To deal with the problem of missing ground truth data for sequence learning, the DFKI has developed the framework ''OCRoRACT'' based on the ''anyOCR''-method. Here, a conventional character-based OCR method is deployed to train an initial OCR model using individually recognized symbols. The resulting lines of text, which (in contrast to an actual ''ground truth'') may  be flawed by errors, are then used to train the sequence learning model instead of the manually generated ground truth. By using contextual information, the system is able to learn how to correct the errors in this pseudo-ground truth. An ''OCRoRACT''-System trained in this fashion for historical documents has proven to be able to deliver suitable recognition rates despite the imposed lack of the required dictionaries.
 <br clear=all>
 <br clear=all>
-===Printer-specific character inventories===
+===Printshop-specific character inventories===
-[[File:CollageOCR.png|thumbnail|Erstellung von Typentabellen am Beispiel des Teilprojekts [[Narragonien]].| link=http://kallimachos.de/kallimachos/images/kallimachos/0/03/CollageOCR.png | alt=Collage verschiedener Typentabellen]]
+[[File:CollageOCR.png|thumbnail|Erstellung von Typentabellen am Beispiel des Teilprojekts [[Narragonien]].| link=http://kallimachos.de/kallimachos/images/kallimachos/0/03/CollageOCR.png | alt=collage of different letter inventories]]
-The OCR Team at the Würzburg University library accompanies and evaluates the development process at the DFKI with the help of existing tools stemming from the EMOP project (''Franken+, Gamera, Tesseract''). With the help of our specially developed tool ''Glyph Miner'', specific inventories of letters are compiled for historic printers and publishers and coupled with a digital MUFI font type. These inventories allow the creation of printer-specific training data for OCR, which can be re-used to capture other texts using the same sets of letters. With this printer-specific approach, we are already able to reach recognition rates of 93% and higher, which has not been reached on similar types of texts before.
+The OCR-Team at Würzburg University´s central library accompanies and evaluates the development process at the DFKI with the help of existing tools stemming from the EMOP project (''Franken+, Gamera, Tesseract''). With the help of our specially developed tool ''Glyph Miner'', specific inventories of letters are compiled for historic printers and publishers and coupled with a digital MUFI font type. These inventories allow for the creation of printer-specific training data for OCR, which can then be re-used to capture further texts using the same sets of letters. With this printshop-specific approach, we are already able to reach recognition rates of 93% and higher, which has not been reached on similar types of texts before.
 <br clear=all>
 ==Synoptic editing tools==
-[[File:NarragonienTransEditor.png | thumbnail | The synoptic editor for the simplified edition of OCR transcription used for the ''use case'' [[Narragonien]]. | link=http://kallimachos.de/kallimachos/images/kallimachos/e/ed/NarragonienTransEditor.png | alt=a view on the synoptic transcription editor.]]
+[[File:NarragonienTransEditor.png | thumbnail | The synoptic editor for the simplified edition of OCR output used for the ''use case'' [[Narragonien]]. | link=http://kallimachos.de/kallimachos/images/kallimachos/e/ed/NarragonienTransEditor.png | alt=The synoptic transcription editor.]]
-This module establishes a framework for online editing tools that enables the users to view texts and images side by side for annotation and text-image linking. These editors can be tailored to suit different project specific requirements. The resulting intuitively usable web-based edition tools can be used without a need for deeper insight into XML and other contemporary text encoding formats, which has been proven to be especially useful for the manual correction of OCR output. In concert with the user management system und the editorial infrastructure provided by the ''WÜsyphus II'' workflow system, this allows for the organized inclusion of research assistants, students and even interested “laymen” in the editing process.
+This module establishes a framework for online editing tools that enables the users to view texts and images side by side for annotation and text-image linking. These editors can be tailored to suit different project specific requirements. The resulting intuitively usable web-based edition tools can be used without a need for deeper insight into XML and other contemporary text encoding formats, which has been proven to be especially useful for the manual correction of OCR output. In concert with the user management system and the editorial infrastructure provided by the ''WÜsyphus II'' workflow system, this allows for the organized inclusion of research assistants, students and even interested “laymen” in the editing process.
 <br clear=all>
 ==Data exports und online presentation==
-The annotated texts, images und additional datatypes will be transferrable into various established export and interchange formats, depending on the individual project requirements. For instance, the data exchange with the [https://textgrid.de/ TextGrid]-Repository can be enabled through XML encoding conforming to the TEI standards. Beneath these export options, individual solutions for presenting data will be available for the project´s web portal. Especially the preceded framework for synoptic editors will also be reusable for building a synopic viewer for presentation of references between images and texts. For instance, the Scans used for the edition, the initial OCR text, manual transcriptions, localisations, annotations and metadata can be viewed or hidden simultaneously in this system.
+The annotated texts, images und additional datatypes will be transferrable into various established export and interchange formats, depending on the individual project requirements. For instance, the data exchange with the [https://textgrid.de/ TextGrid]-Repository can be enabled through XML encoding conforming to the TEI standards. Beneath these export options, individual solutions for presenting data will be available for the project´s web portal. Especially the preceded framework for synoptic editors will also be reusable for building a synopic viewer suitable for presentation of references between images and texts. For instance, the scans used for the edition, the initial OCR text, manual transcriptions, localisations, annotations and metadata can be viewed, hidden or highlighted simultaneously.
 ==Semantic MediaWiki==
 [[File: SMW.png | right | 150px|link= | alt=Logo of Semantic MediaWiki]]
-Based of [https://www.semantic-mediawiki.org/wiki/Semantic_MediaWiki Semantic MediaWiki], an open-source expansion of the MediaWiki system (best known as the scaffolding for Wikipedia and many other wiki systems), an easily usable and quickly adaptable web 3.0 component can be provided for the processing, structuring and presentation of various datasets. Through MediaWikis user management system and the automated versioning of changes, SMW is suited especially well for the implementation of crowdsourcing into the project workflow. For the transfer of data from the wiki environment WüySyphus II, interfaces and import routines are to be developed. For less challenging projects, SMW can also be used directly as a means of presenting data. The options for searches and queries already incorporated in SWM are especially convenient for the implementation of primarily databank-driven projects.
+Based on [https://www.semantic-mediawiki.org/wiki/Semantic_MediaWiki Semantic MediaWiki], an open-source expansion of the MediaWiki system (best known as the scaffolding for Wikipedia and many other wiki systems), an easily usable and quickly adaptable web 3.0-component can be provided for the processing, structuring and presentation of various datasets. Through MediaWiki´s user management system and the automated versioning of changes, SMW is suited especially well for the implementation of ''crowdsourcing'' into a project´s workflow. For the transfer of data from the wiki environment WüySyphus II, new interfaces and import routines are to be developed. For less challenging projects, SMW can also be used directly as a means of presenting data. The options for searches and queries already incorporated in SWM are especially convenient for the implementation of primarily database-driven projects like academic source catalogs.
 <br clear=all>
@@ Zeile 84: / Zeile 84: @@
-These steps can be customized and generalized to be reusable by future projects. Perspectively, even novices and “laymen” in the field of data analysis will be able to profit from automated analysis methods, which can, for instance, be used for the recognition of grammatic cases and structures or named Entities in a text. As a data transfer format between the textual textual analysis and the WüSyphus II workflow system, the CoNLL format is proposed.
+These steps can be customized and generalized to be reusable in future projects. Perspectively, even novices and “laymen” in the field of data analysis will be able to profit from automated analysis methods, which can, for instance, be used for the recognition of grammatic cases and structures or keywords in a text. As a data transfer format between the textual textual analysis and the ''WüSyphus II'' workflow system, the ''CoNLL'' format is proposed.
 <br clear=all>
 ==Versioning und archiving==
-A crucial and often neglected factor for the success of digital projects not only in the humanities is the conclusive guarantee for long-term reproducibility and reusability of the underlying data. For “living”, i.e. for continuously maintained and expanded data collections and (partial) corpora, ensuring data security is of primary concern. To ensure proper versioning, Git-based Systems are envisioned in addition to our wiki systems. Alongside the stable availability and versioning of datasets, methods or long term archiving are to be implemented as well.
+A crucial and often neglected factor regarding the success of digital projects not only in the humanities is the conclusive guarantee for long-term reproducibility and reusability of the underlying data. For “living”, i.e. for continuously maintained and expanded data collections and corpora, ensuring data security is of primary concern. To ensure proper versioning, Git-based Systems are envisioned in addition to our wiki systems. Alongside the stable availability and versioning of datasets, methods or long term archiving are to be implemented as well.
 =Contact=
 ==Coordination==
 {{Adresse Kallimachos}}
-*Dr. [https://elmut.uni-wuerzburg.de/person/23791 Hans-Günter Schmidt] (Project director)
+<!--*Dr. [https://elmut.uni-wuerzburg.de/person/23791 Hans-Günter Schmidt] (Library manager)
+*Dr. [https://wueaddress.uni-wuerzburg.de/person/84041 Uwe Springmann] (Project director)
 *[https://elmut.uni-wuerzburg.de/person/4730 Kerstin Kornhoff] (Acquisition)
@@ Zeile 103: / Zeile 105: @@
 *[https://elmut.uni-wuerzburg.de/person/29458 Almut Wenk] (Acquisition)
-*[https://elmut.uni-wuerzburg.de/person/351 Tanja Altenhöfer] (Acquisition)
+*[https://elmut.uni-wuerzburg.de/person/351 Tanja Altenhöfer] (Acquisition)-->
-*[https://elmut.uni-wuerzburg.de/person/7302 Jonathan Gaede] (Wki systems and subproject communication)
+*Prof. Dr. [http://www.is.informatik.uni-wuerzburg.de/en/staff/puppe_frank/ Frank Puppe] (Project manager)
 *Dr. [https://elmut.uni-wuerzburg.de/person/916 Herbert Baier-Saip] (System development and administration)
+*[https://elmut.uni-wuerzburg.de/person/7302 Jonathan Gaede] (Wki systems and subproject communication)
+<!--
 *Dipl.-Inform. [https://elmut.uni-wuerzburg.de/person/13342 Felix Kirchner] (System development and OCR)
@@ Zeile 120: / Zeile 125: @@
 *[https://elmut.uni-wuerzburg.de/person/8294 Irmgard Götz-Kenner] (Image editing and photography)
+-->
 <br clear=all>
 ----
@@ Zeile 127: / Zeile 132: @@
 ----
 ===Textmining===
+<br clear=all>
 {{Lehrstuhl Comphil}}
 *Prof. Dr. [http://www.jannidis.de/ Fotis Jannidis]
@@ Zeile 132: / Zeile 138: @@
 {{LSKI}}
 *Prof. Dr. [http://www.is.informatik.uni-wuerzburg.de/en/staff/puppe_frank/ Frank Puppe]
 * Prof. Dr. [http://www.is.informatik.uni-wuerzburg.de/staff/hotho Andreas Hotho]
@@ Zeile 139: / Zeile 144: @@
 ----
+===Segmentation and OCR===
+<br clear=all>
+{{AKS}}
+*[http://www1.informatik.uni-wuerzburg.de/mitarbeiterinnen/budig_benedikt/ Benedikt Budig], M.Sc.
+*Dr. [http://www1.informatik.uni-wuerzburg.de/mitarbeiterinnen/dijk_thomas_van/ Thomas van Dijk]
+<br clear=all>
+{{LSKI NoDMIR}}
+*[http://www.is.informatik.uni-wuerzburg.de/staff/reul_christian/ Christian Reul], M.Sc.
+<br clear=all>
+----
 ===Project group ''Narragonien digital''===
 <br clear=all>
@@ Zeile 168: / Zeile 184: @@
 <br clear=all>
 {{Adresse Schulwandbilder}}
-*Univ.-Prof. Dr. phil. habil. [http://www.bildungswissenschaft.uni-wuerzburg.de/team/lehrstuhlinhaber/ Andreas Dörpinghaus] (Lehrstuhlinhaber)
+*Univ.-Prof. Dr. phil. habil. [http://www.bildungswissenschaft.uni-wuerzburg.de/team/lehrstuhlinhaber/ Andreas Dörpinghaus] (Chair holder)
-*Dr. phil. [http://www.bildungswissenschaft.uni-wuerzburg.de/forschungsstelle_historische_bildmedien/team/leitung/ Ina Uphoff] (Projektleiterin)
+*Dr. phil. [http://www.bildungswissenschaft.uni-wuerzburg.de/forschungsstelle_historische_bildmedien/team/leitung/ Ina Uphoff] (Project director)
-*Dipl. Päd. [http://www.bildungswissenschaft.uni-wuerzburg.de/forschungsstelle_historische_bildmedien/team/stellv_leitung/  Eva Zimmer], M.A. (stellv. Projektleiterin)
+*Dipl. Päd. [http://www.bildungswissenschaft.uni-wuerzburg.de/forschungsstelle_historische_bildmedien/team/stellv_leitung/  Eva Zimmer], M.A. (Vice project director)
 <br clear=all>
 ----