Tutorial anonymus translators (en): Unterschied zwischen den Versionen

Aus Kallimachos
Wechseln zu:Navigation, Suche
(Processing the texts for comparative analysis)
(Composition of a text corpus)
 
(5 dazwischenliegende Versionen desselben Benutzers werden nicht angezeigt)
Zeile 5: Zeile 5:
 
===Composition of a text corpus===
 
===Composition of a text corpus===
  
The aim of research for '''[[Identifikation_von_Übersetzern:Main |the project in]]''' is the identification of anonymous arabic-latin translations in medieval times by means of philological and computer-aided methods of style analysis.
+
The aim of research for '''[[Identifikation_von_Übersetzern:Main |the project]]''' is the identification of anonymous arabic-latin translations in medieval times by means of philological and computer-aided methods of style analysis.
 
For this purpose, a corpus of electronic latin texts must be constructed. It's advisable to restrict the corpus to a certain arabic author, e.g. Averroes, or to a technical discipline, e.g. philosophy, astronomy/astrology, medicine, mathematics, alchemy/macig/prophecy or religion.  
 
For this purpose, a corpus of electronic latin texts must be constructed. It's advisable to restrict the corpus to a certain arabic author, e.g. Averroes, or to a technical discipline, e.g. philosophy, astronomy/astrology, medicine, mathematics, alchemy/macig/prophecy or religion.  
 
However, this is only possible if the corpus is large enough. At Wuerzburg University an Averroes-based corpus (Hasse 2010) and two corpora with philosophical and astronomical/astrological translations of 12 century were formed and employed (Hasse 2016 and Hasse-Büttner in print). Herein, we were able to benefit from a list of philosophical arabic-latin translations already provided by Burnett in 2005, as well as Carmody in 1956 with a list of astronomic-astrologic translations (which are imprecisely and obsolete, though). In other branches of science, such lists have yet to be created.
 
However, this is only possible if the corpus is large enough. At Wuerzburg University an Averroes-based corpus (Hasse 2010) and two corpora with philosophical and astronomical/astrological translations of 12 century were formed and employed (Hasse 2016 and Hasse-Büttner in print). Herein, we were able to benefit from a list of philosophical arabic-latin translations already provided by Burnett in 2005, as well as Carmody in 1956 with a list of astronomic-astrologic translations (which are imprecisely and obsolete, though). In other branches of science, such lists have yet to be created.
Zeile 24: Zeile 24:
 
The citable text (3) isn´t usable for stylometry yet, but can be useful for other scientific tasks. Of course, to be able to compare texts using stylometry, they need to be made comparable beforehand. In the field of medieval editions, punctuation rules and orthography are major obstacles, for the punctuation rules often vary according to the national customs of the editors (german, french, english etc.), while the "signal" of the author ist lost. In turn, the orthography ranges from "classizied" editions (e.g. Avicenna Latinus) to the faithful reproduction of the exact orthography of a single medieval manuscript. These problems can be mitigated by radically removing all punctuation marks, changing all uppercase letters to lowercase letters und finally classizying the orthography. The last step is quite painfull for medievalist, but theres is no better alternative. As a first step, it is f.i. helpfull to replace all v with u and all j with i.
 
The citable text (3) isn´t usable for stylometry yet, but can be useful for other scientific tasks. Of course, to be able to compare texts using stylometry, they need to be made comparable beforehand. In the field of medieval editions, punctuation rules and orthography are major obstacles, for the punctuation rules often vary according to the national customs of the editors (german, french, english etc.), while the "signal" of the author ist lost. In turn, the orthography ranges from "classizied" editions (e.g. Avicenna Latinus) to the faithful reproduction of the exact orthography of a single medieval manuscript. These problems can be mitigated by radically removing all punctuation marks, changing all uppercase letters to lowercase letters und finally classizying the orthography. The last step is quite painfull for medievalist, but theres is no better alternative. As a first step, it is f.i. helpfull to replace all v with u and all j with i.
  
This process can be digitally enhanced by asking digital latin reference lexica if they can recognize words in the texts of the corpus. The easiest approach is the comparison with a latin word list. (f.i. '''[https://github.com/cisocrgroup/Resources/tree/master/lexica/latin here]''' or in the word list of the '''[http://extensions.openoffice.org/en/project/latin-spelling-and-hyphenation-dictionaries OpenOffice lexicon]''', which can also be used in a Python script via PyEnchant) or the use of a morphology programm, able to lemmatize and kategorize every word in the text.
+
This process can be digitally enhanced by asking digital latin reference lexica if they can recognize words in the texts of the corpus. The easiest approach is the comparison with a latin word list. (f.i. '''[https://github.com/cisocrgroup/Resources/tree/master/lexica/latin here]''' or in the word list of the '''[http://extensions.openoffice.org/en/project/latin-spelling-and-hyphenation-dictionaries OpenOffice lexicon]''', which can also be used in a Python script via PyEnchant) or the use of a morphology programm, which is able to lemmatize and kategorize every word in the text and look them up in a dictionary
  
 
For the latter, there are currently two open-source solutions:  
 
For the latter, there are currently two open-source solutions:  
Zeile 34: Zeile 34:
 
If the analysis program is configured correctly, it should be able to recogniza large portions of the texts as orthographically correct latin. Unrecognized words can be routinely replaced by their classical counterparts via a progressively adjusted ruleset. Usefull replacement rules are f.i. ci/ti, diff/def, ch/c etc., but also typical OCR mistakes like ic/it, ee/ec, b/h etc.  
 
If the analysis program is configured correctly, it should be able to recogniza large portions of the texts as orthographically correct latin. Unrecognized words can be routinely replaced by their classical counterparts via a progressively adjusted ruleset. Usefull replacement rules are f.i. ci/ti, diff/def, ch/c etc., but also typical OCR mistakes like ic/it, ee/ec, b/h etc.  
  
For a usable stylometric analysis, at least 95% of the words in the processed text should be recognized as correct latin by the reference lexica.  However, 100% recognition should be the goal. To help with the correction of the latin texts, it may be advisable to program simple comparison and input masks, allowing the user to directly compare the words in question with the word in the original scan and correct them on the spot. Furthermore, it is advisable to expand the employed lexicons by custom wort lists to cover the specific vocabulary of arabic-latin translations and the corresponding disciplines.
+
For a usable stylometric analysis, at least 95% of the words in the processed text should be recognized as correct latin by the reference lexica.  However, 100% recognition should be the goal. To help with the correction of the latin texts, it may be advisable to program simple comparison and input masks, allowing the user to directly compare the words in question with the word in the original scan and correct them on the spot. Furthermore, it is advisable to expand the employed dictionariess by custom wort lists to cover the specific vocabulary of arabic-latin translations and the corresponding disciplines.
  
 
=Analysis=
 
=Analysis=
 
== Analysis==
 
== Analysis==
Liegen die Texte des Korpus in derart bereinigten txt-Formaten vor, kann die eigentliche stilometrische Arbeit beginnen. Die Dateien mit den Übersetzungen lassen sich in verschiedene Gruppen sortieren: diejenigen mit unbekannten und diejenigen mit bekannten Übersetzern. Dabei sollte unbedingt der neueste Forschungsstand berücksichtigt werden. Im Zweifelsfall sollte eine Übersetzung lieber als „anonym“ gekennzeichnet werden. In unseren Studien haben wir nur die eindeutigen Übersetzerzuschreibungen, die sich in den Incipits und Kolophonen der Handschriften finden, als verlässlich akzeptiert und alle anderen Texte als anonyme Übersetzungen gekennzeichnet.
+
Once the texts are finally in an adjusted txt-format, the actual stylometric analysis can begin. The dataset can be devided in two groups, one with known and one with unknown translators. It´s important to keep up with the current state of research in this regard. When in doubt, the translation should rather be marked as "anonymus". For our research, we only accepted unambigous attributions found in the incipits and colophones of the manuscripts as reliable and marked all other texts as anonymous translations.  
Dieses Korpus lässt sich nun auf (mindestens) zwei verschiedene Weisen stilometrisch analysieren: Erstens im Hinblick auf ausschließlich von einem bekannten Übersetzer verwendete Wörter und zweitens computergestützt im Hinblick auf die (100, 200 o.ä.)  häufigsten Wörter eines Textes. Die erste Methode wurde in Würzburg entwickelt, die zweite basiert auf dem sogenannten ''Burrows Delta'' (Burrows 2002):
+
  
===(I) Exklusive und zugleich häufig verwendete Wörter===
+
The corpus can be anylized with (at least) two different methods: By looking at words that are used exclusively by one of the known translators and by computerized analysis of the most frequent words (MFW) of a text. The first method has been developed at Würzburg University, the second is based on ''Burrows Delta'' (Burrows 2002).
Die Erfahrung hat gezeigt, dass anonyme Übersetzer anhand von häufig verwendeten, und fachunspezifischen Wörtern, die exklusiv von einem einzigen bekannten Übersetzer verwendet werden, identifiziert werden können.  Dominicus Gundisalvi ist beispielsweise der einzige Übersetzer, der die Partikeln ''sic ut, vel est, cuius comparatio, opus fuit, id per quod, id autem quod'' und ''omnis quod est'' verwendet, die sich auch in der anonymen Übersetzung Alexander von Aphrodisias‘ ''De intellectu'' finden – ein starker Hinweis darauf, dass Gundisalvi Übersetzer dieses Traktats war. Wie kommt man zu diesem Ergebnis? In folgenden zwei Schritten:
+
#Der erste Schritt ist die Suche nach häufigen Termini, die exklusiv nur bei einem einzigen Übersetzer auftauchen. Dazu ist die Programmierung eines einfachen Suchprogramms sehr zu empfehlen. Beim Filtern der Wortlisten helfen flexible Parameter, die eine Mindesthäufigkeit der gesuchten Wörter festlegen oder den Anteil der Texte eines Übersetzers bestimmen, in denen die Wörter jeweils mindestens vorkommen müssen. Um auch Wortgruppen zu analysieren, können die Texte in Listen von n-Grammen, d.h. überlappenden Abfolgen mehrerer Wörter aufgeteilt werden. Damit kann die Menge der für einen Autor exklusiven Wörter auf typische und häufig verwendete Wörter reduziert werden, zum Beispiel auf Wörter, die mindestens 10 mal in den Werken eines Übersetzers und in mindestens 40% seiner Übersetzungen erscheinen. Zum Beispiel erscheint die Wortverbindung ''iterum quia'' in 4 der 10 Übersetzungen Gerhards von Cremona, die sich in unserem philosophischen Korpus finden, und dort insgesamt 56 mal. Es handelt sich also um eine zugleich exklusiv und häufig gebrauchte Wortverbindung bei Gerhard von Cremona. Um dem Verdacht auf mögliche Falschzuschreibungen oder die Zusammenarbeit von Übersetzern nachzugehen, kann zudem ein Parameter eingeführt werden, der eine gewisse Anzahl an Fehlern zulässt, d.h. Wörter, die einige Male eben doch auch von anderen Übersetzern verwendet werden.
+
#Aus dieser Liste müssen dann in einem zweiten Schritt per Hand diejenigen Wörter ausgesiebt werden, die Inhaltswörter sind, wie z.B. ''substantia composita'' oder ''horoscopus'', die typisch für eine bestimmte Fachdisziplin oder Subdisziplin sind wie Metaphysik oder Astrologie. Übrig bleiben stilistische Wörter in einem weiteren Sinn, d.h. solche Wörter, die sich im Prinzip in allen wissenschaftlichen lateinischen Texten des Zeitraums finden lassen könnten, also nicht nur Konjunktionen oder andere Partikeln, sondern auch Wörter und Wortverbindungen wie ''examinatio'', ''annullare''  oder ''demonstrare voluimus'', die nicht fachspezifisch sind. Dieser Fokus ist wichtig, weil die Erfahrung zeigt, dass Inhaltswörter leichter von anderen Übersetzern übernommen werden, während stilistische Wörter und Wortverbindungen stabiler nur bei einem Autor erscheinen.  
+
  
Anschließend kann man für jeden anonym übersetzten Text des Korpus notieren, welche dieser exklusiven und häufig gebrauchten Wörter der Übersetzer in den anonym übersetzten Texten erscheint. Verbindet sich positive und negative Evidenz – wenn sich also (positiv) eine ganze Reihe von solchen exklusiven Wörter eines Übersetzers in einem anonymen Text findet und gleichzeitig keine (negativ) der exklusiven Wörter der anderen Übersetzer –, dann ist die Zuschreibung eines Textes an den bekannten Übersetzer sehr sicher.
+
===(I) Exclusive Words===
 +
Experience has shown that anonymous translators can be identified by looking at frequently used words, that are used exclusively by a single known translators and that are not dependent on the text´s discipline. As an example, Dominicus Gundisalvi is the only translator wo uses 'sic ut, vel est, cuius comparatio, opus fuit, id per quod, id autem quod'' and ''omnis quod est'', which can be also found in the anonymous translation of Alexander of Aphrodisias‘ ''De intellectu'' – a strong indication for Gundisalvi as the actual translator of the tractate. Getting there is a two-step process:
 +
#The first step is searching for frequent terms that are used exclusively by a single translator. To this end, programing a simple search enginge is advisable. When filtering the word lists, flexible parameters can help to set a minimal frequency or the amount of texts that have to contain the word in question. To analyze word groups, the texts can be split into lists of n-grams (i.e. overlapping sequences of multiple words). Thus, the list of exclusive words can be reduced to typical and frequently used terms, f.i. words that appear at least in 10 works of the translator and in 40% of his translations. As an example, the term ''iterum quia'' appears in 4 of the 10 translations by Gerhards of Cremona in our philosophical corpus, where they are used a total of 56 times. Thus, ''iterum quia'' is both an exclusive and frequently used term in Gerhards work. Following a possible suspicion for a a false attribution, an additional parameter for error tolerance can be employded, admitting also words, that are used very rarely by other translators.
 +
#In a second step, this list has to be filtered for content words specific to the text´s discipline, like ''substantia composita'' oder ''horoscopus''.  The remaining words a stilistic words in a more narrow sense, i.e. words that can be used in all scientific latin texts of this perios in principle. These may contain not only conjunctions and other particle words, but also words and phrases like ''examinatio'', ''annullare'' or ''demonstrare voluimus''. This focus is important, as experience has shown that content words are adopted by other translators more easily, whereas stilistic words and phrases appear more stable for one author only.
  
Bei sehr kurzen anonym übersetzten Texten kann es sich lohnen, auch seltenere stilistische Wörter systematisch zu untersuchen, also z.B. solche, die weniger als 10 Mal und in weniger als 40% der Übersetzungen eines Übersetzers erscheinen. Eine solche Analyse muss aber systematisch auch die selteneren Wörter der anderen Übersetzer vergleichen.  Die Erfahrung zeigt, dass nur eine Massierung solcher weniger typischen Wörter und Wortverbindungen in einem anonymen Text wirklich eine Übersetzerattribuierung erlaubt.
 
  
===(II) Computergestützte Stilometrie mit ''Burrows Delta''===
+
Subsequently, you can note for each anonymously translated text in the corpus which of these words appear in the text. If negative and positive evidence fit meaning when a bunch of words exclusive to a single translator appear in the text (positive) and at the same time no exclusive words of other translators (negative), the attribution of the text to the known translator is quite certain.
Die zweite Methode basiert auf der Idee von John Burrows, dass Autorschaft computergestützt durch den Vergleich der standardisierten relativen Häufigkeiten der most frequent words (MFW) einzelner Texte ermittelt werden kann ein Verfahren, das sich bei der computergestützten Autorschaftsattribuierung als ausgesprochen erfolgreich herausgestellt hat. Es gibt verschiedene frei im Web zugängliche Implementierungen dieses Verfahrens. Ein nutzerfreundliches Interface wird innerhalb des Stylo-R-Pakets von Maciej Eder und Jan Rybicki angeboten. Wir haben eine eigene Implementierung in Python verwendet, die auf Fotis Jannidis‘ [https://github.com/fotis007/pydelta pydelta] aufbaut. In der Regel kann man bei solchen Implementierungen zwischen verschiedenen Abstandsmaßen („Deltas“) wählen, also zwischen verschiedenen Methoden, in denen der Computer den Abstand zwischen den Texten berechnet (bzw. genauer: den Abstand zwischen den Listen der Worthäufigkeiten der häufigsten Wörter berechnet). Vergleichsstudien der jüngsten Vergangenheit haben gezeigt, dass ein sehr performantes stilometrische Abstandsmaß das sogenannte „Cosine Delta“ ist. Auch wir haben die besten Ergebnisse mit Cosine Delta erzielt.  
+
  
In einem ersten Schritt werden nur diejenigen Texte des Korpus analysiert, deren Übersetzer bekannt sind. Die Zahl der häufigsten Wörter, also 100, 200 oder mehr, lässt sich in den meisten Implementierungen einstellen. Wir haben sehr gute Ergebnisse mit den häufigsten 150 Wörtern der Texte erzielt. Jeder Text des Korpus wird intern durch einen Vektor dargestellt, der die standardisierten relativen Häufigkeiten dieser Wörter enthält. Der Abstand zwischen diesen Vektoren wird dann mit Cosine Delta berechnet. Der Computer formt dann Gruppen oder Cluster auf Basis dieser Abstände, die in einem Dendrogramm, einem Baum-Diagramm, visualisiert werden. Mithilfe dieses Verfahrens konnte der Computer im Korpus philosophischer Übersetzungen des 12. Jahrhunderts tatsächlich die Übersetzungen bekannter Übersetzer jeweils in eine Gruppe sortieren: die Gruppe der Übersetzungen des Dominicus Gundisalvi, des Gerhard von Cremona etc. Wenn das gelungen ist, ist die Methode sozusagen kalibriert.  
+
For very short texts, it my be advisibale to expand the analysis to less frequent words. However, in this case, the less frequent words of other translators have to be compared as well. Experience shows, that only a huge amassment of these less typical words and phrases in an anonymously translated text allow for a credible attribution.
  
In einem zweiten Schritt werden dann die anonymen Übersetzungen dazu gegeben. Das daraus resultierende Dendrogramm muss sorgfältig interpretiert werden: Bleibt die Gundisalvi-Gruppe (oder Gerhard-Gruppe etc.) des kalibrierten Standards stabil und wird nur um die ein oder andere anonyme Übersetzung erweitert, dann ist es sehr wahrscheinlich, dass diese anonymen Übersetzungen tatsächlich von Gundisalvi produziert wurden. Zerfällt aber die Gundisalvi-Gruppe (oder Gerhard-Gruppe etc.) in mehrere Teilgruppen, die im Dendrogramm nicht mehr verbunden sind, gelingt dem Computer offensichtlich die Zuweisung der anonymen Übersetzung nicht.  
+
===(II) Computerized Stylometry using ''Burrows Delta''===
 +
The second method is based on the ideas of John Burrow´s, which assume that authorship can be identified by comparing the standarized relative frequencies of the most frequent words (MFW) in texts. The method has proven itself to be highly successfull for computerized authorship attribution. Many different open-source implementation of this method can be found in the web. A user-friendly interface is employed as part of the Stylo-package for R by Maciej Eder und Jan Rybicki. We used an own implementation in Python, based on Fotis Jannidis‘ [https://github.com/fotis007/pydelta pydelta]. Usually, these implementation offer the choice over different distance mesaurements or "Deltas", i.e. different methods for the computerized calculation of the stylistic "distance" of two texts. Recent studies have shown, that the so-called "Cosine Delta" is an especially high-performant stylometric distance measurement. We got our best results with Cosine Delta as well. 
 +
 
 +
In the first step, we analyzed the texts in the corpus with a known translator. The range for the most frequent words (100, 200 or more) can be adjusted in most implementations of the method. We got the results with the 150 most frequent words in the texts. Each text of the corpus is processed as a vector, containin the standarized relative frequencies of these words. The distance between these vectors is calculated using Cosine Delta. After that, the computer forms groups or clusters based on these distances, which can then be visualized in a dendrogram. Using this method, the computer was indeed able to sort the texts with known translators into groups according to these translators, i.e. one group for the translations by Dominicus Gundisalvi, one for Gerhard of Cremona etc. Once this clustering succeeds, the method is calibratet, so to speak.
 +
 
 +
In the second step, the anonymous translations are added to the system. The resulting dendograms have to be interpreted carefully: If the Gundisalvi-cluster (or the Gerhard-cluster etc.) remain stable and are merely expanded by additional anonymous translations, these text are likely produced by Gundisalvi. However, if the groups disperse, the computer is obviously unable to attribute the anonymous translations correctly.
 +
 
 +
Luckily, the results of method 1 (exclusive words) and 2 (MFW) mostly matched in our attempts, at least for the philosophical corpus. However, the astronomical/astrological corpus isn´t big enough for method 2 yet.
  
Bei unseren Versuchen zeigte aber erfreulicherweise, dass die Ergebnisse der Methode 1 (Exklusive Wörter) mit den Ergebnissen der Methode 2 (MFW) weitgehend übereinstimmten, zumindest beim philosophischen Korpus. Das astronomisch-astrologische Korpus ist für die Methode 2 allerdings noch nicht groß genug.
 
 
<headertabs />
 
<headertabs />
 
{{Sprachauswahl|Tutorial for identification of anonymous arabic-latin translators (en)|Tutorial_Anonyme_Übersetzer}}
 
{{Sprachauswahl|Tutorial for identification of anonymous arabic-latin translators (en)|Tutorial_Anonyme_Übersetzer}}

Aktuelle Version vom 19. September 2017, 09:51 Uhr

[bearbeiten]

Tutorial for identification of anonymous arabic-latin translators

typische Wörter für Dominicus Gundisalvi als Wordcloud

Composition of a text corpus

The aim of research for the project is the identification of anonymous arabic-latin translations in medieval times by means of philological and computer-aided methods of style analysis. For this purpose, a corpus of electronic latin texts must be constructed. It's advisable to restrict the corpus to a certain arabic author, e.g. Averroes, or to a technical discipline, e.g. philosophy, astronomy/astrology, medicine, mathematics, alchemy/macig/prophecy or religion. However, this is only possible if the corpus is large enough. At Wuerzburg University an Averroes-based corpus (Hasse 2010) and two corpora with philosophical and astronomical/astrological translations of 12 century were formed and employed (Hasse 2016 and Hasse-Büttner in print). Herein, we were able to benefit from a list of philosophical arabic-latin translations already provided by Burnett in 2005, as well as Carmody in 1956 with a list of astronomic-astrologic translations (which are imprecisely and obsolete, though). In other branches of science, such lists have yet to be created.

Translations are available in very different text formats: Some are critically edited, others are only available in earlier printings or only in medieval handwritings. The OCR of modern editions is largely unproblematic. A relieable OCR of early printings, where the computer has to "learn" the officin's characters, is currently a subject of University of Wuerzburg and DFKI Kaiserslautern. At present, it's still advisable to transcribe early printings manually. With hand writings, the manual transcription will be the only viable option for a long time. A preferable textual witness should be chosen, which is especially one who provides a complete and non-revised text (latin authors of early printings are listed at Hasse, Success and Suppression, 2016, S. 317-407).

It's highly recommended to systematically seperate and index scans and the files produced due to further processing. This can be done simply by using seperated subfolders and seperatly managed spreadsheet or by means of a wiki program. This step may seem self-explanatory, but is also overlooked quite easily. The following aspacts should always be distinguished:

  1. the bibliographic mark of origin
  2. the scan
  3. the fully searchable and quotable scan
  4. a text cleaned of all non-textual features (page numbers, critical apparatus etc.)
  5. a normalized orthographic text made for stylometry (e.g. as a simple text file)

Processing the texts for comparative analysis

The citable text (3) isn´t usable for stylometry yet, but can be useful for other scientific tasks. Of course, to be able to compare texts using stylometry, they need to be made comparable beforehand. In the field of medieval editions, punctuation rules and orthography are major obstacles, for the punctuation rules often vary according to the national customs of the editors (german, french, english etc.), while the "signal" of the author ist lost. In turn, the orthography ranges from "classizied" editions (e.g. Avicenna Latinus) to the faithful reproduction of the exact orthography of a single medieval manuscript. These problems can be mitigated by radically removing all punctuation marks, changing all uppercase letters to lowercase letters und finally classizying the orthography. The last step is quite painfull for medievalist, but theres is no better alternative. As a first step, it is f.i. helpfull to replace all v with u and all j with i.

This process can be digitally enhanced by asking digital latin reference lexica if they can recognize words in the texts of the corpus. The easiest approach is the comparison with a latin word list. (f.i. here or in the word list of the OpenOffice lexicon, which can also be used in a Python script via PyEnchant) or the use of a morphology programm, which is able to lemmatize and kategorize every word in the text and look them up in a dictionary

For the latter, there are currently two open-source solutions:

  1. Whitaker’s Words, an Ada-based analysis programm for latin texts.
  2. Morpheus, the parser used by the Perseus program.


Both programs are quite complex and may often require some effort to compile correctly, especially if you want to integrate these programms into your own scripts using a wrapper. As an easier alternative, at least for some tests, the according web services (example) can be used as well. If the analysis program is configured correctly, it should be able to recogniza large portions of the texts as orthographically correct latin. Unrecognized words can be routinely replaced by their classical counterparts via a progressively adjusted ruleset. Usefull replacement rules are f.i. ci/ti, diff/def, ch/c etc., but also typical OCR mistakes like ic/it, ee/ec, b/h etc.

For a usable stylometric analysis, at least 95% of the words in the processed text should be recognized as correct latin by the reference lexica. However, 100% recognition should be the goal. To help with the correction of the latin texts, it may be advisable to program simple comparison and input masks, allowing the user to directly compare the words in question with the word in the original scan and correct them on the spot. Furthermore, it is advisable to expand the employed dictionariess by custom wort lists to cover the specific vocabulary of arabic-latin translations and the corresponding disciplines.

Analysis

Once the texts are finally in an adjusted txt-format, the actual stylometric analysis can begin. The dataset can be devided in two groups, one with known and one with unknown translators. It´s important to keep up with the current state of research in this regard. When in doubt, the translation should rather be marked as "anonymus". For our research, we only accepted unambigous attributions found in the incipits and colophones of the manuscripts as reliable and marked all other texts as anonymous translations.

The corpus can be anylized with (at least) two different methods: By looking at words that are used exclusively by one of the known translators and by computerized analysis of the most frequent words (MFW) of a text. The first method has been developed at Würzburg University, the second is based on Burrows Delta (Burrows 2002).

(I) Exclusive Words

Experience has shown that anonymous translators can be identified by looking at frequently used words, that are used exclusively by a single known translators and that are not dependent on the text´s discipline. As an example, Dominicus Gundisalvi is the only translator wo uses 'sic ut, vel est, cuius comparatio, opus fuit, id per quod, id autem quod and omnis quod est, which can be also found in the anonymous translation of Alexander of Aphrodisias‘ De intellectu – a strong indication for Gundisalvi as the actual translator of the tractate. Getting there is a two-step process:

  1. The first step is searching for frequent terms that are used exclusively by a single translator. To this end, programing a simple search enginge is advisable. When filtering the word lists, flexible parameters can help to set a minimal frequency or the amount of texts that have to contain the word in question. To analyze word groups, the texts can be split into lists of n-grams (i.e. overlapping sequences of multiple words). Thus, the list of exclusive words can be reduced to typical and frequently used terms, f.i. words that appear at least in 10 works of the translator and in 40% of his translations. As an example, the term iterum quia appears in 4 of the 10 translations by Gerhards of Cremona in our philosophical corpus, where they are used a total of 56 times. Thus, iterum quia is both an exclusive and frequently used term in Gerhards work. Following a possible suspicion for a a false attribution, an additional parameter for error tolerance can be employded, admitting also words, that are used very rarely by other translators.
  2. In a second step, this list has to be filtered for content words specific to the text´s discipline, like substantia composita oder horoscopus. The remaining words a stilistic words in a more narrow sense, i.e. words that can be used in all scientific latin texts of this perios in principle. These may contain not only conjunctions and other particle words, but also words and phrases like examinatio, annullare or demonstrare voluimus. This focus is important, as experience has shown that content words are adopted by other translators more easily, whereas stilistic words and phrases appear more stable for one author only.


Subsequently, you can note for each anonymously translated text in the corpus which of these words appear in the text. If negative and positive evidence fit – meaning when a bunch of words exclusive to a single translator appear in the text (positive) and at the same time no exclusive words of other translators (negative), the attribution of the text to the known translator is quite certain.

For very short texts, it my be advisibale to expand the analysis to less frequent words. However, in this case, the less frequent words of other translators have to be compared as well. Experience shows, that only a huge amassment of these less typical words and phrases in an anonymously translated text allow for a credible attribution.

(II) Computerized Stylometry using Burrows Delta

The second method is based on the ideas of John Burrow´s, which assume that authorship can be identified by comparing the standarized relative frequencies of the most frequent words (MFW) in texts. The method has proven itself to be highly successfull for computerized authorship attribution. Many different open-source implementation of this method can be found in the web. A user-friendly interface is employed as part of the Stylo-package for R by Maciej Eder und Jan Rybicki. We used an own implementation in Python, based on Fotis Jannidis‘ pydelta. Usually, these implementation offer the choice over different distance mesaurements or "Deltas", i.e. different methods for the computerized calculation of the stylistic "distance" of two texts. Recent studies have shown, that the so-called "Cosine Delta" is an especially high-performant stylometric distance measurement. We got our best results with Cosine Delta as well.

In the first step, we analyzed the texts in the corpus with a known translator. The range for the most frequent words (100, 200 or more) can be adjusted in most implementations of the method. We got the results with the 150 most frequent words in the texts. Each text of the corpus is processed as a vector, containin the standarized relative frequencies of these words. The distance between these vectors is calculated using Cosine Delta. After that, the computer forms groups or clusters based on these distances, which can then be visualized in a dendrogram. Using this method, the computer was indeed able to sort the texts with known translators into groups according to these translators, i.e. one group for the translations by Dominicus Gundisalvi, one for Gerhard of Cremona etc. Once this clustering succeeds, the method is calibratet, so to speak.

In the second step, the anonymous translations are added to the system. The resulting dendograms have to be interpreted carefully: If the Gundisalvi-cluster (or the Gerhard-cluster etc.) remain stable and are merely expanded by additional anonymous translations, these text are likely produced by Gundisalvi. However, if the groups disperse, the computer is obviously unable to attribute the anonymous translations correctly.

Luckily, the results of method 1 (exclusive words) and 2 (MFW) mostly matched in our attempts, at least for the philosophical corpus. However, the astronomical/astrological corpus isn´t big enough for method 2 yet.

Language: Union Jack  Flagge der BRD