Data entry specs for Chinese text : version 2.0.1 (22nd June 2009) (original) (raw)
Related papers
Automated entry system for Chinese printed documents
Image and Vision Computing, 1995
In this paper, we present a new automated Chinese printed document entry system. This system features automated text/ graph segmentation, and multi-font, multi-size printed Chinese character recognition. Experimental results show that 95.8-99.4% of the top 10 printed characters can be correctly recognized, with the speed of 0.16 seconds/character.
Some Experience in Text Processing in the Chinese Language
The Chinese language present many difficulties in text processing. There are some 7,000 characters in routine use and conventional approaches to keyboards, displays and printers are unable to cope with the set required. Yet the language is a very important one since it is in daily use by one quarter of the population of the world. This paper describes a complete phototypesetting system recently developed for use with text in the Chinese and English languages and now in use for book printing in Beijing and Shanghai. Recent work on the application of a similar approach to data processing in Chinese is also outlined.
Resolving the unencoded character problem for chinese digital libraries
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries - JCDL '05, 2005
Constructing a Chinese digital library, especially for a historical article archiving, is often bothered by the small character sets supported by the current computer systems. This paper is aimed at resolving the unencoded character problem with a practical and composite approach for Chinese digital libraries. The proposed approach consists of the glyph expression model, the glyph structure database, and supporting tools. With this approach, the following problems can be resolved. First, the extensibility of Chinese characters can be preserved. Second, it would be as easy to generate, input, display, and search unencoded characters as existing ones. Third, it is compatible with existing encoding schemes that most computers use. This approach has been utilized by organizations and projects in various application domains including archeology, linguistics, ancient texts, calligraphy and paintings, and stone and bronze rubbings. For example, in Academia Sinica, a very large full-text database of ancient texts called Scripta Sinica has been created using this approach. The Union Catalog of National Digital Archives Project (NDAP) dealt with the unencoded characters encountered when merging the metadata of 12 different thematic domains from various organizations. Also, in Bronze Inscriptions Research Team (BIRT) of Academia Sinica, 3,459 Bronze Inscriptions were added, which is very helpful to the education and research in historic linguistics.
Page Preparation 1. Mise-en-Page
2016
Script for video: https://youtu.be/qay82MW52k8 Let’s start with the “Mise-en-page”, a fancy French expression that could be translated into English as "Page Layout", but for some reason in Codicology the French term has met fortune and is preferred in almost every Western language. Anyhow, the mise-en-page is the result of fitting text into available spaces. And for this there are a number of considerations to take into account, because the resulting object, the book, should have appropriate dimensions for its purported use, and at the same time the reader must have access as comfortable as possible to the text, not to mention that the book must keep within budget. In practice mise-en-page is the result of dividing the text in pages, and the text of each page in lines. The main constraint is, of course, the nature of the writing support, but this operation is also the result of a number of cultural conventions. That will be our subject for this video and the next one. In particular we shall see: • The need of a mise-en-page to give potential readers good access to the text • Which elements must be considered in defining a mise-en-page • Which cultural constraints are in the mise-en-page design • What value margins have in the mise-en-page
OCR for CJK Classical Texts -- Preliminary Examination
2000
The following is a proposal for a new means of separating a cursive character into separate and distinctive characters for further analysis. The proposed method begins with some filtering, i.e., a color filtering, some noise reductions, a conversion of a color image to a gray image, and a binarization. Layout information as to whether a text is written vertically or horizontally as well as the average character size in the text is obtained from the analysis of a peripherally projected histogram. A character is constructed gradually from pixels. First, connected pixels are aggregated to a small segment. Then neighboring segments are collected to a character or a cursive string. At last, segmentation of a cursive string is basically made along the line connecting the concavity on a contour and its vicinity concavity on the opposite contour. The strength of the new method avoids the need for language specific character style knowledge and layout information.
Standardisation in Manuscripts written in Sino-Arabic Scripts and xiaojing
Creating Standards, 2019
Standardisation processes concerning orthography, handwriting and page layout can be observed in manuscripts written in Sino-Arabic scripts that may or may not include transliterated Chinese-language texts (xiaojing).1 Besides identifying some of these processes, it is the objective of the present paper to explore the xiaojing phenomenon with regard to name and script, earliest evidence as well as its function as a system of writing Chinese. The material used for this investigation are trilingual manuscripts written in Arabic, Persian and Chinese mostly produced in Northwest China in the context of Naqshbandiyyabased Sufism and higher education at the madrasas. Accordingly, the texts inscribed in the manuscripts relate mainly to Islamic mysticism and dogma, to prayer and philology. In the presentation of this material, different page-layout formats and configurations of languages will be looked at and the conventions that have been followed in writing xiaojing will also be taken into consideration.
Vietnamese Text Extraction from Book Covers
Tạp chí Khoa học Đại học Đà Lạt, 2017
Automatic information extraction from images reduces the cost, human interference, and timely processing. Converting printed book covers to readable text for later automation process would be useful for a wide range of users such as librarians, bookshop keepers, and individual users. In this paper, we present a novel method for the Vietnamese text extraction from images of scanned book covers. The proposed system accepts the book covers snapshot, filters the input image for an enhancement of quality, locates the regions with text, then utilizes the optical character recognizer (OCR) to extract the text. The last step is to filter the extracted text in accompany with at dictionary to achieve the final text result. Carrying out the experiments with the proposed system using our dataset delivered encouraging experimental results.
2012
Digital editions do have a great potential for new avenues of research, but they also pose vexing research questions that have to be resolved adequately in order to make the resulting edition useful in the long run. One of the many differences between printed editions of texts and digital editions is the open-endedness of the latter, which means that it can be done incrementally and updated without incurring substantial expenses. The medium of digital editions requires the creator to make many assumptions about the texts explicit and record them in a way that can be processed automatically. This is a new concept, which seems foreign to the agenda of a scholar whose ultimate aim is to engage with the text. This article demonstrates that what seems like a detour is actually advancing the understanding of the text and the need objectify a text in this gives access to new dimensions of a text. It then goes on to provide details of a conceptual model for describing a premodern text digit...