Tutorial anonymus translators (en): Unterschied zwischen den Versionen
Aus Kallimachos
| (5 dazwischenliegende Versionen desselben Benutzers werden nicht angezeigt) | |||
| Zeile 5: | Zeile 5: | ||
===Composition of a text corpus=== | ===Composition of a text corpus=== | ||
The aim of research for '''[[Identifikation_von_Übersetzern:Main |the project | The aim of research for '''[[Identifikation_von_Übersetzern:Main |the project]]''' is the identification of anonymous arabic-latin translations in medieval times by means of philological and computer-aided methods of style analysis. | ||
For this purpose, a corpus of electronic latin texts must be constructed. It's advisable to restrict the corpus to a certain arabic author, e.g. Averroes, or to a technical discipline, e.g. philosophy, astronomy/astrology, medicine, mathematics, alchemy/macig/prophecy or religion. | For this purpose, a corpus of electronic latin texts must be constructed. It's advisable to restrict the corpus to a certain arabic author, e.g. Averroes, or to a technical discipline, e.g. philosophy, astronomy/astrology, medicine, mathematics, alchemy/macig/prophecy or religion. | ||
However, this is only possible if the corpus is large enough. At Wuerzburg University an Averroes-based corpus (Hasse 2010) and two corpora with philosophical and astronomical/astrological translations of 12 century were formed and employed (Hasse 2016 and Hasse-Büttner in print). Herein, we were able to benefit from a list of philosophical arabic-latin translations already provided by Burnett in 2005, as well as Carmody in 1956 with a list of astronomic-astrologic translations (which are imprecisely and obsolete, though). In other branches of science, such lists have yet to be created. | However, this is only possible if the corpus is large enough. At Wuerzburg University an Averroes-based corpus (Hasse 2010) and two corpora with philosophical and astronomical/astrological translations of 12 century were formed and employed (Hasse 2016 and Hasse-Büttner in print). Herein, we were able to benefit from a list of philosophical arabic-latin translations already provided by Burnett in 2005, as well as Carmody in 1956 with a list of astronomic-astrologic translations (which are imprecisely and obsolete, though). In other branches of science, such lists have yet to be created. | ||
| Zeile 24: | Zeile 24: | ||
The citable text (3) isn´t usable for stylometry yet, but can be useful for other scientific tasks. Of course, to be able to compare texts using stylometry, they need to be made comparable beforehand. In the field of medieval editions, punctuation rules and orthography are major obstacles, for the punctuation rules often vary according to the national customs of the editors (german, french, english etc.), while the "signal" of the author ist lost. In turn, the orthography ranges from "classizied" editions (e.g. Avicenna Latinus) to the faithful reproduction of the exact orthography of a single medieval manuscript. These problems can be mitigated by radically removing all punctuation marks, changing all uppercase letters to lowercase letters und finally classizying the orthography. The last step is quite painfull for medievalist, but theres is no better alternative. As a first step, it is f.i. helpfull to replace all v with u and all j with i. | The citable text (3) isn´t usable for stylometry yet, but can be useful for other scientific tasks. Of course, to be able to compare texts using stylometry, they need to be made comparable beforehand. In the field of medieval editions, punctuation rules and orthography are major obstacles, for the punctuation rules often vary according to the national customs of the editors (german, french, english etc.), while the "signal" of the author ist lost. In turn, the orthography ranges from "classizied" editions (e.g. Avicenna Latinus) to the faithful reproduction of the exact orthography of a single medieval manuscript. These problems can be mitigated by radically removing all punctuation marks, changing all uppercase letters to lowercase letters und finally classizying the orthography. The last step is quite painfull for medievalist, but theres is no better alternative. As a first step, it is f.i. helpfull to replace all v with u and all j with i. | ||
This process can be digitally enhanced by asking digital latin reference lexica if they can recognize words in the texts of the corpus. The easiest approach is the comparison with a latin word list. (f.i. '''[https://github.com/cisocrgroup/Resources/tree/master/lexica/latin here]''' or in the word list of the '''[http://extensions.openoffice.org/en/project/latin-spelling-and-hyphenation-dictionaries OpenOffice lexicon]''', which can also be used in a Python script via PyEnchant) or the use of a morphology programm, able to lemmatize and kategorize every word in the text | This process can be digitally enhanced by asking digital latin reference lexica if they can recognize words in the texts of the corpus. The easiest approach is the comparison with a latin word list. (f.i. '''[https://github.com/cisocrgroup/Resources/tree/master/lexica/latin here]''' or in the word list of the '''[http://extensions.openoffice.org/en/project/latin-spelling-and-hyphenation-dictionaries OpenOffice lexicon]''', which can also be used in a Python script via PyEnchant) or the use of a morphology programm, which is able to lemmatize and kategorize every word in the text and look them up in a dictionary | ||
For the latter, there are currently two open-source solutions: | For the latter, there are currently two open-source solutions: | ||
| Zeile 34: | Zeile 34: | ||
If the analysis program is configured correctly, it should be able to recogniza large portions of the texts as orthographically correct latin. Unrecognized words can be routinely replaced by their classical counterparts via a progressively adjusted ruleset. Usefull replacement rules are f.i. ci/ti, diff/def, ch/c etc., but also typical OCR mistakes like ic/it, ee/ec, b/h etc. | If the analysis program is configured correctly, it should be able to recogniza large portions of the texts as orthographically correct latin. Unrecognized words can be routinely replaced by their classical counterparts via a progressively adjusted ruleset. Usefull replacement rules are f.i. ci/ti, diff/def, ch/c etc., but also typical OCR mistakes like ic/it, ee/ec, b/h etc. | ||
For a usable stylometric analysis, at least 95% of the words in the processed text should be recognized as correct latin by the reference lexica. However, 100% recognition should be the goal. To help with the correction of the latin texts, it may be advisable to program simple comparison and input masks, allowing the user to directly compare the words in question with the word in the original scan and correct them on the spot. Furthermore, it is advisable to expand the employed | For a usable stylometric analysis, at least 95% of the words in the processed text should be recognized as correct latin by the reference lexica. However, 100% recognition should be the goal. To help with the correction of the latin texts, it may be advisable to program simple comparison and input masks, allowing the user to directly compare the words in question with the word in the original scan and correct them on the spot. Furthermore, it is advisable to expand the employed dictionariess by custom wort lists to cover the specific vocabulary of arabic-latin translations and the corresponding disciplines. | ||
=Analysis= | =Analysis= | ||
== Analysis== | == Analysis== | ||
Once the texts are finally in an adjusted txt-format, the actual stylometric analysis can begin. The dataset can be devided in two groups, one with known and one with unknown translators. It´s important to keep up with the current state of research in this regard. When in doubt, the translation should rather be marked as "anonymus". For our research, we only accepted unambigous attributions found in the incipits and colophones of the manuscripts as reliable and marked all other texts as anonymous translations. | |||
The corpus can be anylized with (at least) two different methods: By looking at words that are used exclusively by one of the known translators and by computerized analysis of the most frequent words (MFW) of a text. The first method has been developed at Würzburg University, the second is based on ''Burrows Delta'' (Burrows 2002). | |||
===(I) Exclusive Words=== | |||
Experience has shown that anonymous translators can be identified by looking at frequently used words, that are used exclusively by a single known translators and that are not dependent on the text´s discipline. As an example, Dominicus Gundisalvi is the only translator wo uses 'sic ut, vel est, cuius comparatio, opus fuit, id per quod, id autem quod'' and ''omnis quod est'', which can be also found in the anonymous translation of Alexander of Aphrodisias‘ ''De intellectu'' – a strong indication for Gundisalvi as the actual translator of the tractate. Getting there is a two-step process: | |||
#The first step is searching for frequent terms that are used exclusively by a single translator. To this end, programing a simple search enginge is advisable. When filtering the word lists, flexible parameters can help to set a minimal frequency or the amount of texts that have to contain the word in question. To analyze word groups, the texts can be split into lists of n-grams (i.e. overlapping sequences of multiple words). Thus, the list of exclusive words can be reduced to typical and frequently used terms, f.i. words that appear at least in 10 works of the translator and in 40% of his translations. As an example, the term ''iterum quia'' appears in 4 of the 10 translations by Gerhards of Cremona in our philosophical corpus, where they are used a total of 56 times. Thus, ''iterum quia'' is both an exclusive and frequently used term in Gerhards work. Following a possible suspicion for a a false attribution, an additional parameter for error tolerance can be employded, admitting also words, that are used very rarely by other translators. | |||
#In a second step, this list has to be filtered for content words specific to the text´s discipline, like ''substantia composita'' oder ''horoscopus''. The remaining words a stilistic words in a more narrow sense, i.e. words that can be used in all scientific latin texts of this perios in principle. These may contain not only conjunctions and other particle words, but also words and phrases like ''examinatio'', ''annullare'' or ''demonstrare voluimus''. This focus is important, as experience has shown that content words are adopted by other translators more easily, whereas stilistic words and phrases appear more stable for one author only. | |||
Subsequently, you can note for each anonymously translated text in the corpus which of these words appear in the text. If negative and positive evidence fit – meaning when a bunch of words exclusive to a single translator appear in the text (positive) and at the same time no exclusive words of other translators (negative), the attribution of the text to the known translator is quite certain. | |||
For very short texts, it my be advisibale to expand the analysis to less frequent words. However, in this case, the less frequent words of other translators have to be compared as well. Experience shows, that only a huge amassment of these less typical words and phrases in an anonymously translated text allow for a credible attribution. | |||
===(II) Computerized Stylometry using ''Burrows Delta''=== | |||
The second method is based on the ideas of John Burrow´s, which assume that authorship can be identified by comparing the standarized relative frequencies of the most frequent words (MFW) in texts. The method has proven itself to be highly successfull for computerized authorship attribution. Many different open-source implementation of this method can be found in the web. A user-friendly interface is employed as part of the Stylo-package for R by Maciej Eder und Jan Rybicki. We used an own implementation in Python, based on Fotis Jannidis‘ [https://github.com/fotis007/pydelta pydelta]. Usually, these implementation offer the choice over different distance mesaurements or "Deltas", i.e. different methods for the computerized calculation of the stylistic "distance" of two texts. Recent studies have shown, that the so-called "Cosine Delta" is an especially high-performant stylometric distance measurement. We got our best results with Cosine Delta as well. | |||
In the first step, we analyzed the texts in the corpus with a known translator. The range for the most frequent words (100, 200 or more) can be adjusted in most implementations of the method. We got the results with the 150 most frequent words in the texts. Each text of the corpus is processed as a vector, containin the standarized relative frequencies of these words. The distance between these vectors is calculated using Cosine Delta. After that, the computer forms groups or clusters based on these distances, which can then be visualized in a dendrogram. Using this method, the computer was indeed able to sort the texts with known translators into groups according to these translators, i.e. one group for the translations by Dominicus Gundisalvi, one for Gerhard of Cremona etc. Once this clustering succeeds, the method is calibratet, so to speak. | |||
In the second step, the anonymous translations are added to the system. The resulting dendograms have to be interpreted carefully: If the Gundisalvi-cluster (or the Gerhard-cluster etc.) remain stable and are merely expanded by additional anonymous translations, these text are likely produced by Gundisalvi. However, if the groups disperse, the computer is obviously unable to attribute the anonymous translations correctly. | |||
Luckily, the results of method 1 (exclusive words) and 2 (MFW) mostly matched in our attempts, at least for the philosophical corpus. However, the astronomical/astrological corpus isn´t big enough for method 2 yet. | |||
<headertabs /> | <headertabs /> | ||
{{Sprachauswahl|Tutorial for identification of anonymous arabic-latin translators (en)|Tutorial_Anonyme_Übersetzer}} | {{Sprachauswahl|Tutorial for identification of anonymous arabic-latin translators (en)|Tutorial_Anonyme_Übersetzer}} | ||