Digitization of Classical Indian Texts as a Part of Digital Humanities for Academic and Commercial Applications (original) (raw)
Related papers
Language Technology and Digitization of Ancient Records in Indian Local Scripts
2016
The paper presents the aims, Digital versions of recorded of knowledge are important cultural and economic resources. Stored in efficient and accessible digital systems, they can enable preservation and wider distribution of our knowledge heritage through World Wide Web. Digitized records can allow quick search for phrases, words, and combinations of words in any record if appropriate language technology is used. There are many projects currently active worldwide that attempts to put in electronic form ancient texts on different subject areas. Taking six digitization projects in Tamil and Malayalam as samples this study evaluates the status of digital libraries in Indian languages, and discusses their problems and possible solutions. The paper stresses the need for giving priority for the development of digital library packages that can process Indian local scripts to make digitization projects fruitful.
Engaging with an Indian Epic: A Digital Approach
International Journal of Computer Applications
India's heritage texts have had a long history of being mined for knowledge of language and culture by Christian missionaries to India, colonial officers of the East Indian Company and the British Raj, German, European and American Indologists and later by native scholars driven by nationalist sentiments. It was during their investigative exercises that a vast body of India's heritage texts was recovered and made the subject of rigorous study. A large number of editions in English translation as well as in modern Indian vernacular languages started appearing on the scene. The focus then was primarily on patthoddhar [retrieval of the 'ur'-text] or making shuddhasanskarana [correct edition]. The exercise was purely manual and time-consuming and concentrated on a limited number of texts. But there still lies a vast treasure of ancient knowledge in India's palm leaf manuscripts, waiting to be discovered, deciphered and interpreted for contemporary readers and scholars. It is impossible to ignore the ubiquitousness of Information Technology based tools and the scope that they offer for large-scale data mining. Of late, a large body of historical texts is being made available digitally by repositories and institutions worldwide. The time is ripe for digitally inspired editions, beginning with studies in corpus linguistics. This paper throws light on the challenges to be addressed for the preparation of a digital historical corpus edition of Sarala Mahabharata, a local version of the famous Sanskrit Mahabharata by Vyasa, from Odisha in the eastern part of India.
Review of Advances in Digital Recognition of Indian Language Manuscripts
International Journal of Scientific Research in Science, Engineering and Technology, 2018
Digital content creation and document management in Indian languages are in progressing stage. OCR has become an administrative requirement for effective governance and daily activities. Scripts including those from medieval to contemporary time are of literary and political importance. The present research initiatives highlights the importance and needs of efforts in recognition of printed and handwritten documents written in languages of Indian origin. This paper is aims at reviewing the state of various scripts in use including those from medieval to present era and explores the prospective of digital recognition of handwritten and printed texts and thereby pointing towards futuristic trends in developing restoration software for Indic scripts. While OCRs for Indic scripts like Devanagari has attained good results and still improving the accuracy levels, many medieval and ancient scripts have very little attempts. Challenge is due to the number of languages and their diverse scri...
De Gruyter eBooks, 2019
Cataloging and Citing Greek and Latin Authors and Works illustrates not only how Classicists have built upon larger standards and data models such as the Functional Requirements for Bibliographic Records (FRBR, allowing us to represent different versions of a text) and the Text Encoding Initiative (TEI) Guidelines for XML encoding of source texts (representing the logical structure of sources) but also highlights some major contributions from Classics. Alison Babeu, Digital Librarian at Perseus, describes a new form of catalog for Greek and Latin works that exploits the FRBR data model to represent the many versions of our sourcesincluding translations. Christopher Blackwell and Neel Smith built on FRBR to develop the Canonical Text Services (CTS) data model as part of the CITE Architecture. CTS provides an explicit framework within which we can address any substring in any version of a text, allowing us to create annotations that can be maintained for years and even for generations. This addressesat least within the limited space of textual dataa problem that has plagued hypertext systems since the 1970s and that still afflicts the World Wide Web. Those who read these papers years from now will surely find that many of the URLs in the citations no longer function but all of the CTS citations should be usablewhether we remain with this data model or replace it with something more expressive. Computer Scientists Jochen Tiepmar and Gerhard Heyer show how they were able to develop a CTS server that could scale to more than a billion words, thus establishing the practical nature of the CTS protocol. If there were a Nobel Prize for Classics, my nominations would go to Blackwell and Smith for CITE/CTS and to Bruce Robertson, whose paper on Optical Character Recognition opens the section on Data Entry, Collection, and Analysis for Classical Philology. Robertson has worked a decade, with funding and without, on the absolutely essential problem of converting images of print Greek into machine readable text. In this effort, he has mastered a wide range of techniques drawn from areas such as computer human interaction, statistical analysis, and machine learning. We can now acquire billions of words of Ancient Greek from printed sources and not just from multiple editions of individual works (allowing us not only to trace the development of our texts over time but also to identify quotations of Greek texts in articles and books, thus allowing us to see which passages are studied by different scholarly communities at different times). He has enabled fundamental new work on Greek. Meanwhile the papers by Tauber, Burns, and Coffee are on representing characters, on a pipeline for textual analysis of Classical languages and on a system that detects where one text alludes towithout extensively quotinganother text. At its base, philology depends upon the editions which provide information about our source texts, including variant readings, a proposed reconstruction of the original, and reasoning behind decisions made in analyzing the text.
2022
This paper describes additional aspects of a digital tool called the 'Textual History Tool'. We describe its various salient features with special reference to those of its features that may help the philologist digitize commentaries and sub-commentaries on a text. This tool captures the historical evolution of a text through various temporal stages, and interrelated data culled from various types of related texts. We use the text of the K\=a\'sik\=avrtti (KV) as a sample text, and with the help of philologists, we digitize the commentaries available to us. We digitize the Ny\=asa (Ny), the Padama\~njar\=i (Pm) and sub commentaries on the KV text known as the Tantraprad\=ipa (Tp), and the Makaranda (Mk). We divide each commentary and sub-commentary into functional units and describe the methodology and motivation behind the functional unit division. Our functional unit division helps generate more accurate phylogenetic trees for the text, based on distance methods using ...
Wrecked Indian Fonts: A Problem for Digitalization of Indic Documents
In the start of 1990's, India has seen a great boost in the information science world. This era has introduced the India to the modern computer systems. All the big industries and business houses started using computers instead of manpower for their dayto-day work. With time, Indian scholarly and library community was also influenced by the evolution of computers at great extent. The Indian scholarly articles and books were digitally published in Indic scripts over the internet for open public access. Fonts were the key element for digital publishing. In the early days, no one gave its attention towards the standard format of these Indian language fonts. This ends in publishing of large scholarly material with no semantic meaning for machine understanding, as the legacy documents goes through information loss during substitution of fonts in migration strategy. This paper helps in understanding the cause of this information loss during Indic font substitution with statin reason g the backend problem of these legacy Indic fonts in digital documents.
Effect Of Pre-Processing On Historical Sanskrit Text Documents
In this paper, the effect of pre-processing on binarization is explored. Here, pre-processing and binarization operations are performed on a historical Sanskrit text document. After scanning, pre-processing operations are applied on the image to remove noise. Pre-processing techniques play an important role in binarization. Newly developed pre-processing techniques are Non Local means and Total Variation methods. Total Variation methods are motivated by the developments in compressive sensing methods like 1 l optimization. Binarization is used as a pre-processor before OCR, because most OCR packages work only on black and white images.
Journal of the Text Encoding Initiative, 2020
As a student of intellectual, religious, and cultural developments in areas of the Chinese cultural sphere, my initial motivation for engaging with digital texts thirty years ago was to open up the new possibilities that the digital medium oered to researchers, without losing any of the aordances of a traditional printed edition. This requirement includes use of texts for reading, translating, annotating, quoting, and publishing, thus integrating with the whole of the scholarly workow. At that time theories of electronic texts started to appear and the Text Encoding Initiative had already begun to create a common text model and interchange specication, based mainly on European languages. For East Asian texts, things were much more complicated because of dierent and quickly evolving character encoding standards, dierent textual traditions and approaches to text editing, as well as dierent institutional embedding.