Tutorial anonymus translators (en)

Aus Kallimachos
Version vom 19. September 2017, 09:51 Uhr von Jonathan (Diskussion) (Composition of a text corpus)
(Unterschied) ← Nächstältere Version | Aktuelle Version (Unterschied) | Nächstjüngere Version → (Unterschied)
Wechseln zu:Navigation, Suche

Tutorial for identification of anonymous arabic-latin translators

typische Wörter für Dominicus Gundisalvi als Wordcloud

Composition of a text corpus

The aim of research for the project is the identification of anonymous arabic-latin translations in medieval times by means of philological and computer-aided methods of style analysis. For this purpose, a corpus of electronic latin texts must be constructed. It's advisable to restrict the corpus to a certain arabic author, e.g. Averroes, or to a technical discipline, e.g. philosophy, astronomy/astrology, medicine, mathematics, alchemy/macig/prophecy or religion. However, this is only possible if the corpus is large enough. At Wuerzburg University an Averroes-based corpus (Hasse 2010) and two corpora with philosophical and astronomical/astrological translations of 12 century were formed and employed (Hasse 2016 and Hasse-Büttner in print). Herein, we were able to benefit from a list of philosophical arabic-latin translations already provided by Burnett in 2005, as well as Carmody in 1956 with a list of astronomic-astrologic translations (which are imprecisely and obsolete, though). In other branches of science, such lists have yet to be created.

Translations are available in very different text formats: Some are critically edited, others are only available in earlier printings or only in medieval handwritings. The OCR of modern editions is largely unproblematic. A relieable OCR of early printings, where the computer has to "learn" the officin's characters, is currently a subject of University of Wuerzburg and DFKI Kaiserslautern. At present, it's still advisable to transcribe early printings manually. With hand writings, the manual transcription will be the only viable option for a long time. A preferable textual witness should be chosen, which is especially one who provides a complete and non-revised text (latin authors of early printings are listed at Hasse, Success and Suppression, 2016, S. 317-407).

It's highly recommended to systematically seperate and index scans and the files produced due to further processing. This can be done simply by using seperated subfolders and seperatly managed spreadsheet or by means of a wiki program. This step may seem self-explanatory, but is also overlooked quite easily. The following aspacts should always be distinguished:

  1. the bibliographic mark of origin
  2. the scan
  3. the fully searchable and quotable scan
  4. a text cleaned of all non-textual features (page numbers, critical apparatus etc.)
  5. a normalized orthographic text made for stylometry (e.g. as a simple text file)

Processing the texts for comparative analysis

The citable text (3) isn´t usable for stylometry yet, but can be useful for other scientific tasks. Of course, to be able to compare texts using stylometry, they need to be made comparable beforehand. In the field of medieval editions, punctuation rules and orthography are major obstacles, for the punctuation rules often vary according to the national customs of the editors (german, french, english etc.), while the "signal" of the author ist lost. In turn, the orthography ranges from "classizied" editions (e.g. Avicenna Latinus) to the faithful reproduction of the exact orthography of a single medieval manuscript. These problems can be mitigated by radically removing all punctuation marks, changing all uppercase letters to lowercase letters und finally classizying the orthography. The last step is quite painfull for medievalist, but theres is no better alternative. As a first step, it is f.i. helpfull to replace all v with u and all j with i.

This process can be digitally enhanced by asking digital latin reference lexica if they can recognize words in the texts of the corpus. The easiest approach is the comparison with a latin word list. (f.i. here or in the word list of the OpenOffice lexicon, which can also be used in a Python script via PyEnchant) or the use of a morphology programm, which is able to lemmatize and kategorize every word in the text and look them up in a dictionary

For the latter, there are currently two open-source solutions:

  1. Whitaker’s Words, an Ada-based analysis programm for latin texts.
  2. Morpheus, the parser used by the Perseus program.

Both programs are quite complex and may often require some effort to compile correctly, especially if you want to integrate these programms into your own scripts using a wrapper. As an easier alternative, at least for some tests, the according web services (example) can be used as well. If the analysis program is configured correctly, it should be able to recogniza large portions of the texts as orthographically correct latin. Unrecognized words can be routinely replaced by their classical counterparts via a progressively adjusted ruleset. Usefull replacement rules are f.i. ci/ti, diff/def, ch/c etc., but also typical OCR mistakes like ic/it, ee/ec, b/h etc.

For a usable stylometric analysis, at least 95% of the words in the processed text should be recognized as correct latin by the reference lexica. However, 100% recognition should be the goal. To help with the correction of the latin texts, it may be advisable to program simple comparison and input masks, allowing the user to directly compare the words in question with the word in the original scan and correct them on the spot. Furthermore, it is advisable to expand the employed dictionariess by custom wort lists to cover the specific vocabulary of arabic-latin translations and the corresponding disciplines.


Once the texts are finally in an adjusted txt-format, the actual stylometric analysis can begin. The dataset can be devided in two groups, one with known and one with unknown translators. It´s important to keep up with the current state of research in this regard. When in doubt, the translation should rather be marked as "anonymus". For our research, we only accepted unambigous attributions found in the incipits and colophones of the manuscripts as reliable and marked all other texts as anonymous translations.

The corpus can be anylized with (at least) two different methods: By looking at words that are used exclusively by one of the known translators and by computerized analysis of the most frequent words (MFW) of a text. The first method has been developed at Würzburg University, the second is based on Burrows Delta (Burrows 2002).

(I) Exclusive Words

Experience has shown that anonymous translators can be identified by looking at frequently used words, that are used exclusively by a single known translators and that are not dependent on the text´s discipline. As an example, Dominicus Gundisalvi is the only translator wo uses 'sic ut, vel est, cuius comparatio, opus fuit, id per quod, id autem quod and omnis quod est, which can be also found in the anonymous translation of Alexander of Aphrodisias‘ De intellectu – a strong indication for Gundisalvi as the actual translator of the tractate. Getting there is a two-step process:

  1. The first step is searching for frequent terms that are used exclusively by a single translator. To this end, programing a simple search enginge is advisable. When filtering the word lists, flexible parameters can help to set a minimal frequency or the amount of texts that have to contain the word in question. To analyze word groups, the texts can be split into lists of n-grams (i.e. overlapping sequences of multiple words). Thus, the list of exclusive words can be reduced to typical and frequently used terms, f.i. words that appear at least in 10 works of the translator and in 40% of his translations. As an example, the term iterum quia appears in 4 of the 10 translations by Gerhards of Cremona in our philosophical corpus, where they are used a total of 56 times. Thus, iterum quia is both an exclusive and frequently used term in Gerhards work. Following a possible suspicion for a a false attribution, an additional parameter for error tolerance can be employded, admitting also words, that are used very rarely by other translators.
  2. In a second step, this list has to be filtered for content words specific to the text´s discipline, like substantia composita oder horoscopus. The remaining words a stilistic words in a more narrow sense, i.e. words that can be used in all scientific latin texts of this perios in principle. These may contain not only conjunctions and other particle words, but also words and phrases like examinatio, annullare or demonstrare voluimus. This focus is important, as experience has shown that content words are adopted by other translators more easily, whereas stilistic words and phrases appear more stable for one author only.

Subsequently, you can note for each anonymously translated text in the corpus which of these words appear in the text. If negative and positive evidence fit – meaning when a bunch of words exclusive to a single translator appear in the text (positive) and at the same time no exclusive words of other translators (negative), the attribution of the text to the known translator is quite certain.

For very short texts, it my be advisibale to expand the analysis to less frequent words. However, in this case, the less frequent words of other translators have to be compared as well. Experience shows, that only a huge amassment of these less typical words and phrases in an anonymously translated text allow for a credible attribution.

(II) Computerized Stylometry using Burrows Delta

The second method is based on the ideas of John Burrow´s, which assume that authorship can be identified by comparing the standarized relative frequencies of the most frequent words (MFW) in texts. The method has proven itself to be highly successfull for computerized authorship attribution. Many different open-source implementation of this method can be found in the web. A user-friendly interface is employed as part of the Stylo-package for R by Maciej Eder und Jan Rybicki. We used an own implementation in Python, based on Fotis Jannidis‘ pydelta. Usually, these implementation offer the choice over different distance mesaurements or "Deltas", i.e. different methods for the computerized calculation of the stylistic "distance" of two texts. Recent studies have shown, that the so-called "Cosine Delta" is an especially high-performant stylometric distance measurement. We got our best results with Cosine Delta as well.

In the first step, we analyzed the texts in the corpus with a known translator. The range for the most frequent words (100, 200 or more) can be adjusted in most implementations of the method. We got the results with the 150 most frequent words in the texts. Each text of the corpus is processed as a vector, containin the standarized relative frequencies of these words. The distance between these vectors is calculated using Cosine Delta. After that, the computer forms groups or clusters based on these distances, which can then be visualized in a dendrogram. Using this method, the computer was indeed able to sort the texts with known translators into groups according to these translators, i.e. one group for the translations by Dominicus Gundisalvi, one for Gerhard of Cremona etc. Once this clustering succeeds, the method is calibratet, so to speak.

In the second step, the anonymous translations are added to the system. The resulting dendograms have to be interpreted carefully: If the Gundisalvi-cluster (or the Gerhard-cluster etc.) remain stable and are merely expanded by additional anonymous translations, these text are likely produced by Gundisalvi. However, if the groups disperse, the computer is obviously unable to attribute the anonymous translations correctly.

Luckily, the results of method 1 (exclusive words) and 2 (MFW) mostly matched in our attempts, at least for the philosophical corpus. However, the astronomical/astrological corpus isn´t big enough for method 2 yet.