Annotation Research Papers - Academia.edu (original) (raw)
We describe the development of web-based software that facilitates large-scale, crowdsourced image extraction and annotation within image-heavy corpora that are of interest to the digital humanities. An application of this software is... more
We describe the development of web-based software that facilitates large-scale, crowdsourced image extraction and annotation within image-heavy corpora that are of interest to the digital humanities. An application of this software is then detailed and evaluated through a case study where it was deployed within Amazon Mechanical Turk to extract and annotate faces from the archives of Time magazine. Annotation labels included categories such as age, gender, and race that were subsequently used to train machine learning models. The systemization of our crowdsourced data collection and worker quality verification procedures are detailed within this case study. We outline a data verification methodology that used validation images and required only two annotations per image to produce high-fidelity data that has comparable results to methods using five annotations per image. Finally, we provide instructions for customizing our software to meet the needs for other studies, with the goal ...
Topic detection with large and noisy data collections such as social media must address both scalability and accuracy challenges. KeyGraph is an efficient method that improves on current solutions by considering keyword cooccurrence. We... more
Topic detection with large and noisy data collections such as social media must address both scalability and accuracy challenges. KeyGraph is an efficient method that improves on current solutions by considering keyword cooccurrence. We show that KeyGraph has similar accuracy when compared to state-of-the-art approaches on small, well-annotated collections, and it can successfully filter irrelevant documents and identify events in large and noisy social media collections. An extensive evaluation using Amazon’s Mechanical Turk demonstrated the increased accuracy and high precision of KeyGraph, as well as superior runtime performance compared to other solutions.
Offline cursive script recognition and their associated issues are still fresh despite of last few decades’ research. This paper presents an annotated comparison of proposed and recently published preprocessing techniques with reported... more
Offline cursive script recognition and their associated issues are still fresh despite of last few decades’ research. This paper presents an annotated comparison of proposed and recently published preprocessing techniques with reported work in the offline cursive script recognition.
Normally, in the offline script analysis, the input is a paper image or a word or a digit and the desired output is ASCII text. This task involves several preprocessing steps, and some of them are quite hard such as line removal from text, skew removal, reference line detection (lower/upper baselines), slant removal, scaling, noise elimination, contour smoothing and skeleton. Moreover, subsequent stage of segmentation (if any) and recognition is also highly dependent on these preprocessing techniques. This paper presents an analysis and annotated comparison of latest preprocessing techniques proposed by authors with those reported in the literature on IAM/CEDAR benchmark databases. Finally, future work and persist problems are highlighted.
We present a general model and information server for the digital annotation of printed documents. The resulting annotation framework supports both informal and structured annotations as well as context-dependent services. A demonstrator... more
We present a general model and information server for the digital annotation of printed documents. The resulting annotation framework supports both informal and structured annotations as well as context-dependent services. A demonstrator application for mammography that features both enhanced writing and reading activities is described.
The automatic annotation of medical images is a prerequisite for building comprehensive semantic archives that can be used to enhance evidence-based diagnosis, physician education, and biomedical research. Annotation also has important... more
The automatic annotation of medical images is a prerequisite for building comprehensive semantic archives that can be used to enhance evidence-based diagnosis, physician education, and biomedical research. Annotation also has important applications in the automatic generation of structured radiology reports. Much of the prior research work has focused on annotating images with properties such as the modality of the image, or the biological system or body region being imaged. However, many challenges remain for the annotation of high-level semantic content in medical images (e.g., presence of calcification, vessel obstruction, etc.) due to the difficulty in discovering relationships and associations between low-level image features and high-level semantic concepts. This difficulty is further compounded by the lack of labelled training data. In this paper, we present a method for the automatic semantic annotation of medical images that leverages techniques from content-based image ret...
Legal texts play an essential role in the organisation, be it public or private where each actor must be aware of, and comply with regulations. However, because of the difficulties of the legal domain, the actors prefer to rely on the... more
Legal texts play an essential role in the organisation, be it public or private where each
actor must be aware of, and comply with regulations. However, because of the difficulties of the legal domain, the actors prefer to rely on the expert rather than resorting to search for the regulation in a collection of documents. In this paper, we use a rule-based approach based on the contextual exploration method for the semantic annotation of Algerian legal texts written in Arabic language. We are interested in the specification of the semantic information of the provision types: obligation, permission and prohibition, and the arguments role and action. The preliminary experiment presented promising results for the specification of provision types.
The paper discusses the issues faced while evaluating the different aspects of mapping language specific data onto the IMAGACT interface. The paper is a result of working on the IMAGACT project under the supervision of Prof. Girish Nath... more
The paper discusses the issues faced while evaluating the different aspects of mapping language specific data onto the IMAGACT interface. The paper is a result of working on the IMAGACT project under the supervision of Prof. Girish Nath Jha of the School of Sanskrit and Indic Studies, Jawaharlal Nehru University. IMAGACT is an initiative of a consortium (namely University of Florence, ILC-CNR, Pisa and University of Siena) funded in Italy. It has been designed on corpus-based annotation by mother-tongue linguists. Beginning with English and Italian speech corpora, a substantial amount of variation of action-oriented lexicons across multiple action concepts was noticed.
The nature and extent of the demand for research capable workers, is a topic of intense concern locally and internationally. With around 60% of graduates in Australia finding employment outside of academia on graduation, PhD programs are... more
The nature and extent of the demand for research capable workers, is a topic of intense concern locally and internationally. With around 60% of graduates in Australia finding employment outside of academia on graduation, PhD programs are under increasing pressure to be relevant to the contemporary workplace beyond the walls of the academy. However, as yet there is very little research on exactly what industry needs are as often the discussion with industry results in recommendations based on anecdote rather than data. This study aims to fill this gap by analysing a large data set of job ads to see what employers outside academia really want from graduates.
New social annotation practices have the potential to become a “signature pedagogy” (Shulman 2005) for educators in literary studies, because social annotation encapsulates both the expected learning outcomes and the underlying value... more
New social annotation practices have the potential to become a “signature pedagogy” (Shulman 2005) for educators in literary studies, because social annotation encapsulates both the expected learning outcomes and the underlying value commitments of literature education. We give an account of a project conducted by colleagues at the Education University of Hong Kong, during which colleagues explored social annotation technologies in literary studies courses. After implementing social annotation in our courses, instructors held roundtable discussions, collected surveys and conducted focus group interviews. Basing our interpretation of these data on Louise Rosenblatt’s transactional theory of reading and writing, we propose that social annotation can help students engage with literary texts more effectively by showing them how to move toward an aesthetic mode of reading. Students participating in social annotation, moreover, understood its application to literary studies in ways that directly reproduced Rosenblatt’s account of literary interpretation.
- by Chris Bowers
- •
- Multimedia, Video, Annotation
Zeyrek, D , Demirşahin, I., Sevdik-Çallı, A. B., Çakıcı, R. (2013). Turkish Discourse Bank: Porting a discourse annotation style to a morphologically rich language. Dialog & Discourse 4 (2) pp. 174-184. This paper describes the current... more
This paper presents a project for the creation of an ontology-encoded version of the “Fontes” by Giuseppe Lugli, one of the most important collections of sources for the study of the topography of ancient Rome. Only seven volumes of the... more
This paper presents a project for the creation of an ontology-encoded version of the “Fontes” by Giuseppe Lugli, one of the most important collections of sources for the study of the topography of ancient Rome. Only seven volumes of the work were published by Lugli between 1952 and 1962; the publication of the remaining volumes is still in progress. The goal of the project is the creation of a semantic “ancient sources” GIS, a set of interactive maps of ancient Rome which will provide spatial information along with the descriptions recorded by ancient sources. The model chosen for the encoding is the CIDOC CRM, an international ISO standard developed to describe concepts and relationships used in cultural heritage documentation. The event-based CIDOC CRM ontology seems ideal to describe the topographic and historical layers designed by Lugli, as it will clarify the relations existing among the events recorded by the ancient sources, and the monuments and places to which they refer.
This article explores ways of using digital methods for analyzing texts in academic teaching with the aim of promoting discussions about the theory and methodology of interpretation. In a first step, some relevant distinctions and topics... more
This article explores ways of using digital methods for analyzing texts in academic teaching with the aim of promoting discussions about the theory and methodology of interpretation. In a first step, some relevant distinctions and topics from theoretical debates about interpretation are identified and explained (content-specifying vs. content-transcending interpretation; (interpretation) relativism vs. objective criteria for good interpretations; text description as a heuristic instrument vs. justification for interpretations). Next, the field of digital text analysis is introduced, with a distinction of two paradigms (distant vs. close reading) and some remarks on existing work on computer-aided methods in interpretation. Finally, two ideas for academic teaching are introduced. The first takes a close reading approach using collaborative manual annotation to prompt discussion of interpretative pluralism (mainly in connection with content-specifying interpretation). The second applies algorithmic distant reading methods to initiate discussions about the role of text description for (content-transcending) interpretation. Both ideas serve as praxis-based approaches to theoretical/methodological issues concerning interpretation and at the same time help to teach a well-reflected use of computer-aided methods for analyzing texts in the field of literary studies.
The paper describes the experience of the MCCA research group with regards to the interoperability of Folker, ELAN and Praat computer programmes for multimodal linguistic annotation, describing the reasons for choosing them instead of... more
The paper describes the experience of the MCCA research group with regards to the interoperability of Folker, ELAN and Praat computer programmes for multimodal linguistic annotation, describing the reasons for choosing them instead of other available software. Furthermore, from the point of view of users, the authors indicate the possible (technical) solutions that could facilitate the work of linguistic annotators of multilingual data.
- by first last
- •
- Annotation, Ground Truth
This research is an annotated translation. The problems of this research are: (1) What are the difficulties encountered by the researcher/translator during the process of translation? (2) How are those difficulties solved? The aims of... more
This research is an annotated translation. The problems of this research are: (1) What are the difficulties encountered by the researcher/translator during the process of translation? (2) How are those difficulties solved? The aims of this research are: (1) To attain factual information concerning the difficulties faced by the researcher/translator while translating the source text, (2) To find out the plausible solution by referring to the principles of translation, the translation strategies, the theories of translation, and the theories of both the Indonesian language as well as the English language. The methods of the research are: (1) the introspective and retrospective methods, (2) the sampling method is purposefully randomly. From the 167 data collected, the researcher purposefully selected the most difficult data which amount to 45 data; from these 45 data, the researcher has randomly chosen 25 data to be analysed. The results of this research are: (1) only five out of the thirteen principles of translation were employed in this research, i.e. Meaning, Style and Clarity, Form, Register, and Idiom; (2) thirteen out of thirty translation strategies were applied in this research, i.e. Transposition (2 data), Unit Shift (2 data), Phrasal Verb (2 data), Idiom (2 data), Information Change (2 data), Loan (2 data), Calque (2 data), Explicitness Change (2 data), Expansion (2 data), Compression (2 data), Antonymy (2 data), Cohesion Change (1 data), and Clause Structure Change (2 data). The finding of this research is that not all of the thirteen principles of translation and the thirty translation strategies are employed because there are only twenty five data analysed.
"The paper describes common principles for annotating communicative macro episodes in the ORD corpus of Russian everyday speech, which takes into account different types and conditions of spoken communication. The paper provides concise... more
"The paper describes common principles for annotating communicative macro episodes in the ORD corpus of Russian everyday speech, which takes into account different types and conditions of spoken communication. The paper provides concise statistical description of the ORD macro episodes (e.g., type/place of communication and speakers' social roles), and reveals the most common types of episodes presented in the ORD corpus.
В докладе представлены основные принципы аннотирования коммуникативных макроэпизодов, используемые в речевом корпусе «Один речевой день» и учитывающие тип и базовые условия повседневной речевой коммуникации; приведены некоторые результаты статистической обработки корпуса (распределение макроэпизодов по типу, месту коммуникации и социальным ролям) и выявлены наиболее типичные по данным параметрам макроэпизоды корпуса ОРД."
This paper outlines our experiences with applying collaborative tagging in e-learning systems to supplement more traditional metadata gathering approaches. Over the last 10 years, the learning object paradigm has emerged in e-learning and... more
This paper outlines our experiences with applying collaborative tagging in e-learning systems to supplement more traditional metadata gathering approaches. Over the last 10 years, the learning object paradigm has emerged in e-learning and has caused standards bodies to focus on creating metadata repositories based upon strict domain-free taxonomies. We argue that the social collection phenomena and flexible metadata standards are key in collecting the kinds of metadata required for adaptable online learning. This paper takes a broad look at tagging within elearning. It first looks at the implications for tagging within the domain through an analysis of tags students provided when classifying learning objects. Next, it looks at two case studies based on novel interfaces for applying tagging. These two systems emphasize tags being applied within learning content through the use of a highlighting metaphor.
A szövegeket kiegészítő annotációk digitális reneszánszát éljük. Tézisem szerint a digitális annotáció - amely a saját megismerésünk eszköze - a közösségi tudásunk hordozója. Dolgozatomban először Gutenberg-galaxisban lefektetett... more
A szövegeket kiegészítő annotációk digitális reneszánszát éljük. Tézisem szerint a digitális annotáció - amely a saját megismerésünk eszköze - a közösségi
tudásunk hordozója. Dolgozatomban először Gutenberg-galaxisban lefektetett annotációs eszközöket, azok szerepét vizsgálom. A nyomtatott szövegek
annotációjának kutatásából reflektálok az új, digitális hipertextre, és lehetőségeire. A Xanadu rendszertervet és a webkettő paradigmát, mint a digitális annotáció
modelljének előképeit elemzem. A vizsgálat tárgya a jelenkori globális tartalomaggregátorok annotációs rendszerein keresztül a tudásszolgáltatás és
annotáció problémáinak, lehetőségeinek megismerése. A végkövetkeztetés szerint a digitális annotáció létező, elterjedt és használt jelenség; mégis rendszerszintű
változtatások szükségesek, hogy közösségi tudásunk alapjává válhasson.
Close reading describes a set of procedures and methods that distinguishes the scholarly apprehension of textual material from the more prosaic reading practices of everyday life. Its origins and ancestry are rooted in the exegetical... more
Close reading describes a set of procedures and methods that distinguishes the scholarly apprehension of textual material from the more prosaic reading practices of everyday life. Its origins and ancestry are rooted in the exegetical traditions of sacred texts (principally from the Hindu, Jewish, Buddhist, Christian, Zoroastrian, and Islamic traditions) as well as the philological strategies applied to classical works such as the Homeric epics in the Graeco-Roman tradition, or the Chinese 詩經 (Shijing) or Classic of Poetry. Cognate traditions of exegesis and commentary formed around Roman Law and Canon Law of the Christian Church, and also finds expression in the long tradition of Chinese historical commentaries and exegeses on the Five Classics and Four Books. As these practices developed in the West, they were adapted to medieval and early modern literary texts from which the early manifestations of modern secular literary analysis came into being in European and American universities. Close reading comprises the methodologies at the centre of literary scholarship as it developed in the modern academy over the past century or so, and has come to define a central set of practices that dominated scholarly work in English departments until the turn to literary and critical theory in the late-nineteen-sixties. This essay provides an overview of these dominant forms of close reading in the modern Western academy. The focus rests upon close reading practices and their codification in university English departments, although reference is made to non-Western reading practices and philological traditions, as well as to significant non-Anglophone alternatives to the common understanding of literary close reading.
Alberto MONTANER, «Don Sancho de Azpetia, escudero vizcaíno (Quijote, I, VIII-IX)», Emblemata: Revista Aragonesa de Emblemática [ISSN 1137-1056], vol. X (2004), pp. 215-332.
Over the past three decades, the history of reading has become an increasingly lively field of scholarship. Important case studies have documented the freedom that individual readers have enjoyed in handling their books. On a structural... more
Over the past three decades, the history of reading has become an increasingly lively field of scholarship. Important case studies have documented the freedom that individual readers have enjoyed in handling their books. On a structural level, however, the scholarship has been hampered by limited access to an inherently fragmented body of evidence. This article introduces a new research project, Annotated Books Online (ABO), which aims to provide a platform for the study of manuscript annotations in early modern printed books. ABO offers an open-access research environment where scholars and students can collect and view new evidence, as well as collaborate on transcriptions, translations, and new research initiatives. To illuminate the promising potential of new research on marginalia and adumbrate
the challenges ahead, the second part of this article offers a case study of three intriguing annotated copies of Homer, once owned by the German reformer Philipp Melanchthon (Columbia University Library, Plimpton 880 1517 H37).
- by Arnoud Visser and +1
- •
- Homer, Digital Humanities, Book History, Renaissance Studies
Discussing the study of translation as it relates to various disciplines, from comparative literature to philosophy and stylistics, in this paper I examine the philological and translatory practice of Vladimir Nabokov. Specifically, I... more
Discussing the study of translation as it relates to various disciplines, from comparative literature to philosophy and stylistics, in this paper I examine the philological and translatory practice of Vladimir Nabokov. Specifically, I discuss how translation and writing, for Nabokov, is inseparable from a specific philological engagement in annotation and commentary. This is further examined in the case of Nabokov’s two seminal works, his translation of Pushkin’s 'Eugene Onegin' and his novel 'Lolita.' By viewing these two works as exemplary texts which blur the line between translation and writing, or between analysis and synthesis, I further examine how Nabokov’s Lolita and Nabokov’s self translation of the novel into Russian is reflected in the Croatian literary context. A detailed comparison of the Croatian translation of 'Lolita' is provided, and a new, annotated translation of 'Lolita' is proposed. Lastly, I provide a comparative reading of Nabokov’s 'Lolita' and Antun Šoljan’s novel 'The Traitors' in order to illuminate a thus far unacknowledged correlation between the two writers and translators.
The process of transcribing and annotating non-manual features presents challenges for sign language researchers. This paper describes the approach used by our research team to integrate the Facial Action Coding System (FACS) with the... more
The process of transcribing and annotating non-manual features presents challenges for sign language researchers. This paper describes the approach used by our research team to integrate the Facial Action Coding System (FACS) with the EUDICO Linguistic Annotator (ELAN) program to allow us to more accurately and efficiently code non-manual features. Preliminary findings are presented which demonstrate that this approach is useful for a fuller description of facial expressions.
What exactly has changed in the production of secondary school English over the last decade? To provide one part of an answer to that question, this paper takes the practice of annotation—a defining activity of the subject English in the... more
What exactly has changed in the production of secondary school English over the last decade? To provide one part of an answer to that question, this paper takes the practice of annotation—a defining activity of the subject English in the UK seldom researched—and uses it as a device for uncovering aspects of changes in the subject. The theoretical approach is
that of multimodal social semiotics with an historical perspective. A multimodal approach looks beyond language to all forms of communication (Jewitt, 2009; Kress, 2009). The approach used in this paper allows investigation of the interactions among changes in the social environment, policy, curriculum, technology, and student resources. We draw on illustrative examples from three research projects around subject English: the Gains and Losses Project—consisting of 100 textbooks, largely from 1935 to the present day (Bezemer & Kress, 2009); case studies collected in 2000 for the Production of School English Project (Kress, Jewitt, Bourne, Franks, Hardcastle et al., 2005); and the Evaluation of Schools
Whiteboard Expansion Project (Moss et al., 2007).
Since the verbs are the most important grammatical category in a language and actions, activities and states are denoted with the help of them, the goal of this project is to study the different meanings of verb "gereftan" and show the... more
Since the verbs are the most important grammatical category in a language and actions, activities and states are denoted with the help of them, the goal of this project is to study the different meanings of verb "gereftan" and show the relations between its meanings based on Filmore Frame Semantics (1979). This study has been done by examining the syntactic and semantic distribution of arguments of the selected concepts of Persian verb "gereftan" which has obtained from social networks like Twitter and Youtube text. Results from this work showed that "gereftan" has different meanings such as "receiving", "buying" and "understanding" which are the sub-branches of the "commercial transaction" frame. The study has proved that the verbs "receiving", "buying" and "understanding" with different concepts have some core elements that these elements are common and trigger create semantic relation between verbs "receiving", "buying" and "understanding".
In the article, we will present our experiences with, and problems that we came across while, working on a multilingual corpus of speech data (Polish and German) and conducting a pragmalinguistic and suprasegmental analysis of it.... more
In the article, we will present our experiences with, and problems that we came across while, working on a multilingual corpus of speech data (Polish and German) and conducting a pragmalinguistic and suprasegmental analysis of it. Furthermore, we will present some reflections on the notions of parallelity and comparability in this context. Creating corpora of spoken language constitutes a great challenge for the researcher due to its elusive nature. Speech data can be accessed by the researcher either in the form of transcripts of audio/video recordings (according to the methods of multimodal analysis) or in the form of notes from speech interactions (according to the ethnographic method). The researcher who wants to collect data for his/her specific purposes − for example if he/she wants to investigate (im)politeness − has to create settings, a context of interaction and a situation in which a given phenomenon can be elicited. The need for a phonetic analysis makes it necessary to make audio or video recordings of data. These need to be made in a recording studio in order to ensure quality suitable for such an analysis (e.g. one channel per speaker, no background noises). Participants in recording sessions do not behave as naturally as they would in a natural setting (i.e. without microphones or cameras). What is more, spoken language is characterised by phenomena that are exclusively typical for it when compared to written language. They include: anacoluthons, corrections, repairs, hearer signals, speaker signals, particles, discourse markers etc., i.e. phenomena that are treated as communicative “disturbances” in written language but are fundamental in face-to-face-interactions. Considering the above requirements, one can state that creating corpora of spoken language requires a completely different approach than corpora of written language. In the following article, a bilingual (Polish and German) corpus of spoken language is presented. The corpus has been created as part of the MCCA: Multimodal Communication: Culturological Analysis project for the purposes of culturological and suprasegmental analysis and consists of three types of recordings. They are: dyadic conversations, scripted monologues (where the participants were supposed to intonate sentences in order to achieve a certain result), and extracts from TV talk shows. The recordings have further been transcribed using the Folker programme and GAT2 (GesprächsAnalytisches Transkriptionssystem) conventions, annotated (by means of the ELAN programme) and phonetically analysed (using Praat programme).
- by Silvia Bonacchi and +2
- •
- Corpus Linguistics, Multimodality, Annotation, Transcription
The present study entitled “Linguistic Annotation of Malayalam Speech Corpus” is a study on the annotating the Malayalam speech data collected from the informants of the various district of Kerala, who are the natives speakers of... more
The present study entitled “Linguistic Annotation of Malayalam Speech Corpus” is a study on the annotating the Malayalam speech data collected from the informants of the various district of Kerala, who are the natives speakers of Malayalam. The recorded data was annotated at the word level and syllable level using the Praat software. The study is a basic guideline to annotators who are exploring using the Praat software. A Step by step analysis using the Praat software is discussed in detail in the study.
A set of five same questions were asked to the informants from which each of them spoke the common way they speak with the peers. The questions were simple so that they could answer easily without a deep thinking or prepreparation.
IntEnz is the name for the Integrated relational Enzyme database and is the official version of the Enzyme Nomenclature. The Enzyme Nomenclature comprises recommendations of the Nomenclature Committee of the International Union of Bio... more
IntEnz is the name for the Integrated relational Enzyme database and is the official version of the Enzyme Nomenclature. The Enzyme Nomenclature comprises recommendations of the Nomenclature Committee of the International Union of Bio chemistry and Molecular Biology (NC‐IUBMB) on the nomenclature and classification of enzyme‐catalysed reactions. IntEnz is supported by NC‐IUBMB and contains enzyme data curated and approved by this committee. The database IntEnz is available at http://www.ebi.ac.uk/intenz.
Una serie di linee guida per l'annotazione manuale degli enunciati nominali (o frasi senza verbo) in un corpus di testi dialogici informali scaricati dal web. Queste linee guida tengono conto della varietà di forme degli enunciati... more
Una serie di linee guida per l'annotazione manuale degli enunciati nominali (o frasi senza verbo) in un corpus di testi dialogici informali scaricati dal web.
Queste linee guida tengono conto della varietà di forme degli enunciati nominali e della natura substandard dei testi annotati.
Queste istruzioni sono state applicate durante l'annotazione del corpus COSMIANU (liberamente scaricabile a questo link: https://hlt-nlp.fbk.eu/technologies/cosmianu1.0), presentato a CLiC-it 2018.