XML Annotation of Hebrew Elements in Judeo-Arabic Texts (original) (raw)
Related papers
Developing an XML-based, exploitable linguistic database of the Hebrew text of Gen. 1:1-2:3
University of Pretoria Electronic Theses and Dissertations, 2008
The thesis discusses a series of related techniques that prepare and transform raw linguistic data for advanced processing in order to unveil hidden grammatical patterns. A threedimensional array is identified as a suitable data structure to build a data cube to capture multidimensional linguistic data in a computer's temporary storage facility. It also enables online analytical processing, like slicing, to be executed on this data cube in order to reveal various subsets and presentations of the data. XML is investigated as a suitable mark-up language to permanently store such an exploitable databank of Biblical Hebrew linguistic data. This concept is illustrated by tagging a phonetic transcription of Genesis 1:1-2:3 on various linguistic levels and manipulating this databank. Transferring the data set between an XML file and a threedimensional array creates a stable environment allowing editing and advanced processing of the data in order to confirm existing knowledge or to mine for new, yet undiscovered, linguistic features. Two experiments are executed to demonstrate possible text-mining procedures. Finally, visualisation is discussed as a technique that enhances interaction between the human researcher and the computerised technologies supporting the process of knowledge creation. Although the data set is very small there are exciting indications that the compilation and analysis of aggregate linguistic data may assist linguists to perform rigorous research, for example regarding the definitions of semantic functions and the mapping of these functions onto the syntactic module.
The Linguistic Annotation Framework (LAF) provides a general, extensible stand-off markup system for corpora. This paper discusses LAF-Fabric, a new tool to analyse LAF resources in general with an extension to process the Hebrew Bible in particular. We first walk through the history of the Hebrew Bible as text database in decennium-wide steps. Then we describe how LAF-Fabric may serve as an analysis tool for this corpus. Finally, we describe three analytic projects/workflows that benefit from the new LAF representation: 1) the study of linguistic variation: extract cooccurrence data of common nouns between the books of the Bible (Martijn Naaijer); 2) the study of the grammar of Hebrew poetry in the Psalms: extract clause typology (Gino Kalkman); 3) construction of a parser of classical Hebrew by Data Oriented Parsing: generate tree structures from the database (Andreas van Cranenburgh).
Arabic Codes in Hebrew Texts: On the Typology of Literary Code-switching
2016
In the late 1950s, Iraqi Jews were either forced or chose to leave Iraq for Israel. Most Iraqi Jewish authors found it impossible to continue writing in Arabic in Israel and so faced the literary challenge of switching to Hebrew. As bilinguals, Iraqi Jewish novelists have employed Arabic in some of their Hebrew literary works, including strategies of code-switching. Conversational code-switching is traditionally divided into three types: intersentential code-switching, intrasentential code-switching, and tag-switching. Although code-switching in literary texts has its distinct features, research on written code-switching generally follows the typology applied to conversational code-switching. This article focuses on the typology of code-switching in literary texts. It investigates Arabic codes used in three Hebrew novels written by Iraqi Jewish novelists. The article suggests three main types of literary code-switching in view of the mutual relationship between author, text, and reader: Hard-Access, Easy-Access, and Ambiguous Access code-switching.
Code-switching in Judaeo-Arabic documents from the Cairo Geniza
Multilingua, 2017
This paper investigates code-switching and script-switching in medieval documents from the Cairo Geniza, written in Judaeo-Arabic (Arabic in Hebrew script), Hebrew, Arabic and Aramaic. Legal documents regularly show a macaronic style of Judaeo-Arabic, Aramaic and Hebrew, while in letters code-switching from Judaeo-Arabic to Hebrew is tied in with various socio-linguistic circumstances and indicates how markedly Jewish the sort of text is. Merchants in particular employed a style of writing devoid of Hebrew elements, whereas community dignitaries are much more prone to mixing of Hebrew and Judaeo-Arabic (and Arabic), although the degree of mixing also depends on a number of other factors, such as on the individual education. Analyses show great variation within the repertoire of single authors, as shown on the example of Daniel b. ʿAzariah, according to the purpose of the correspondence, with religious affairs attracting the highest Hebrew content, whereas letters pertaining to trade...
TAJA Corpus: Linguistically Tagged Written Algerian Judeo-Arabic Corpus
Journal of Jewish Languages 10 (1): 24-53., 2022
The Tagged Algerian Judeo-Arabic (TAJA) corpus is the first linguistically annotated corpus of any Judeo-Arabic dialect regardless of geography and period. The corpus is a genre-diverse collection of written Modern Algerian Judeo-Arabic texts, encompassing translations of the Bible and of liturgical texts, commentaries and original Judeo-Arabic books and journals. The TAJA corpus was manually annotated with parts-of-speech (POS) tags and detailed morphology tags. The goal of the new corpus is twofold. First, it preserves this endangered Judeo-Arabic language, expanding on previous fieldwork and going beyond the study of individual written texts. The corpus has already enabled us to make strides towards a grammar of written Algerian Judeo-Arabic. Second, this tagged corpus serves as a foundation for the development of Judeo-Arabic-specific Natural Language Processing (NLP) tools, which allow automatic POS tagging and morphological annotation of large collections of yet untapped texts in Algerian Judeo-Arabic and other Judeo-Arabic varieties.
Round-tripping Biblical Hebrew linguistic data
Proceedings of 2007 Information Resources Management Association, International Conference, 2007
In processing language electronically, one can either concentrate on the digital simulation of human understanding and language production, or on the most appropriate way of storing and using existing knowledge. Both are valid and important. This paper falls in the second category, by assuming that it is useful to capture the results of linguistic analyses in well-designed, exploitable, electronic databanks. The paper focuses on the conversion of linguistic data of Genesis 1 between an XML data cube and a multidimensional array structure in Visual Basic 6 in order to facilitate data access and manipulation.
2015
Biblical Hebrew databases and grammars are not a novelty: numerous medieval treatises deal with grammatical features of the Hebrew Bible, providing statistics as to the number of occurrences of a given phenomenon. This can already be seen in the marginal notes that accompany the biblical text on Masoretic manuscripts. The development of computer sciences in the twentieth century has paved the way for the creation of extensive computer databases of the Hebrew Bible, starting with the text itself — usually that of the Leningrad Codex rather than an eclectic edition or a text with critical apparatus. Lemmatisation enhances the textual database by identifying the various forms of a given lemma, thus enabling the user to perform lexicological queries. Morphological analysis encodes such features as part of speech, person, gender, number, state, aspect, and so on. The user is then able to search for all occurrences of a given pattern.