Tutorial anonymus translators (en): Unterschied zwischen den Versionen
Aus Kallimachos
| Zeile 26: | Zeile 26: | ||
This process can be digitally enhanced by asking digital latin reference lexica if they can recognize words in the texts of the corpus. The easiest approach is the comparison with a latin word list. (f.i. '''[https://github.com/cisocrgroup/Resources/tree/master/lexica/latin here]''' or in the word list of the '''[http://extensions.openoffice.org/en/project/latin-spelling-and-hyphenation-dictionaries OpenOffice-lexicon]''', which can also be used in a Python script via PyEnchant) or the use of a morphology programm, able to lemmatize and kategorize every word in the text. | This process can be digitally enhanced by asking digital latin reference lexica if they can recognize words in the texts of the corpus. The easiest approach is the comparison with a latin word list. (f.i. '''[https://github.com/cisocrgroup/Resources/tree/master/lexica/latin here]''' or in the word list of the '''[http://extensions.openoffice.org/en/project/latin-spelling-and-hyphenation-dictionaries OpenOffice-lexicon]''', which can also be used in a Python script via PyEnchant) or the use of a morphology programm, able to lemmatize and kategorize every word in the text. | ||
For the latter, there are currently two open-source solutions: | |||
#'''[http://mk270.github.io/whitakers-words/ Whitaker’s Words]''', | #'''[http://mk270.github.io/whitakers-words/ Whitaker’s Words]''', an Ada-based analysis programm for latin texts. | ||
#'''[https://github.com/PerseusDL/morpheus Morpheus]''', | #'''[https://github.com/PerseusDL/morpheus Morpheus]''', the parser used by the Perseus program. | ||
Both programs are quite complex and may often require some effort to compile correctly - even more, if you want to integrate these programms into your own scripts. As an easier alternative, at least for some tests, the according web services ([http://services.perseids.org/bsp/morphologyservice/analysis/word?lang=lat&engine=morpheuslat&word=et example]) can be used as well. | |||
If the analysis program is configured correctly, it should be able to recogniza large portions of the texts as orthographically correct latin. Unrecognized words can be routinely replaced by they classical counterparts via a progressively adjusted ruleset. Usefull replacement rules are f.i. ci/ti, diff/def, ch/c etc., but also typical OCR mistakes like ic/it, ee/ec, b/h etc. | |||
For a usable stylometric analysis, at least 95% of the words in the processed text should be recognized as correct latin by the reference lexica. However, 100% recognition should be the goal. To help with the correction of the latin texts, it may be advisable to program simple comparison and input masks, allowing the user to directly compare the words in question with the word in the original scan and correct them on the spot. Furthermore, it is advisable to expand the employed lexicons by custom wort lists to cover the specific vocabulary of arabic-latin translations and the corresponding disciplines. | |||
=Analysis= | =Analysis= | ||