Constructing and validating readability models: the method of integrating multilevel linguistic features with machine learning (original) (raw)
Related papers
Use of a New Set of Linguistic Features to Improve Automatic Assessment of Text Readability
2012
The present paper proposes and evaluates a readability assessment method designed for Japanese learners of EFL (English as a foreign language). The proposed readability assessment method is constructed by a regression algorithm using a new set of linguistic features that were employed separately in previous studies. The results showed that the proposed readability assessment method, which used all the linguistic features employed in previous studies, yielded a lower error of assessment than readability assessment methods using only some of these linguistic features.
A multivariate model for classifying texts' readability
We report on results from using the multi-variate readability model SVIT to classify texts into various levels. We investigate how the language features integrated in the SVIT model can be transformed to values on known criteria like vocabulary, grammatical fluency and propositional knowledge. Such text criteria, sensitive to content , readability and genre in combination with the profile of a student's reading ability form the base of individually adapted texts. The procedure of levelling texts into different stages of complexity is presented along with results from the first cycle of tests conducted on 8th grade students. The results show that SVIT can be used to classify texts into different complexity levels.
Linguistic Features for Readability Assessment
2020
Readability assessment aims to automatically classify text by the level appropriate for learning readers. Traditional approaches to this task utilize a variety of linguistically motivated features paired with simple machine learning models. More recent methods have improved performance by discarding these features and utilizing deep learning models. However, it is unknown whether augmenting deep learning models with linguistically motivated features would improve performance further. This paper combines these two approaches with the goal of improving overall model performance and addressing this question. Evaluating on two large readability corpora, we find that, given sufficient training data, augmenting deep learning models with linguistically motivated features does not improve state-of-the-art performance. Our results provide preliminary evidence for the hypothesis that the state-of-theart deep learning models represent linguistic features of the text related to readability. Future research on the nature of representations formed in these models can shed light on the learned features and their relations to linguistically motivated ones hypothesized in traditional approaches.
A comparison of features for automatic readability assessment
Proceedings of the 23rd International Conference on Computational Linguistics Posters, 2010
Several sets of explanatory variables-including shallow, language modeling, POS, syntactic, and discourse features-are compared and evaluated in terms of their impact on predicting the grade level of reading material for primary school students. We find that features based on in-domain language models have the highest predictive power. Entity-density (a discourse feature) and POS-features, in particular nouns, are individually very useful but highly correlated. Average sentence length (a shallow feature) is more useful-and less expensive to compute-than individual syntactic features. A judicious combination of features examined here results in a significant improvement over the state of the art.
Assessing English language sentences readability using machine learning models
PeerJ Computer Science, 2022
Readability is an active field of research in the late nineteenth century and vigorously persuaded to date. The recent boom in data-driven machine learning has created a viable path forward for readability classification and ranking. The evaluation of text readability is a time-honoured issue with even more relevance in today’s information-rich world. This paper addresses the task of readability assessment for the English language. Given the input sentences, the objective is to predict its level of readability, which corresponds to the level of literacy anticipated from the target readers. This readability aspect plays a crucial role in drafting and comprehending processes of English language learning. Selecting and presenting a suitable collection of sentences for English Language Learners may play a vital role in enhancing their learning curve. In this research, we have used 30,000 English sentences for experimentation. Additionally, they have been annotated into seven different r...
Reading Level Identification Using Natural Language Processing Techniques
2021
This paper investigates using the Bidirectional Encoder Representations from Transformers (BERT) algorithm and lexical-syntactic features to measure readability. Readability is important in many disciplines, for functions such as selecting passages for school children, assessing the complexity of publications, and writing documentation. Text at an appropriate reading level will help make communication clear and effective. Readability is primarily measured using well-established statistical methods. Recent advances in Natural Language Processing (NLP) have had mixed success incorporating higher-level text features in a way that consistently beats established metrics. This paper contributes a readability method using a modern transformer technique and compares the results to established metrics. This paper finds that the combination of BERT and readability metrics provide a significant improvement in estimation of readability as defined by Crossley et al. [1]. The BERT+Readability mod...
Assessing Vietnamese Text Readability using Multi-Level Linguistic Features
International Journal of Advanced Computer Science and Applications, 2020
Text readability is the problem of determining whether a text is suitable for a certain group of readers, and thus building a model to assess the readability of text yields great significance across the disciplines of science, publishing, and education. While text readability has attracted attention since the late nineteenth century for English and other popular languages, it remains relatively underexplored in Vietnamese. Previous studies on this topic in Vietnamese have only focused on the examination of shallow word-level features using surface statistics such as frequency and ratio. Hence, features at higher levels like sentence structure and meaning are still untapped. In this study, we propose the most comprehensive analysis of Vietnamese text readability to date, targeting features at all linguistic levels, ranging from the lexical and phrasal elements to syntactic and semantic factors. This work pioneers the investigation on the effects of multi-level linguistic features on text readability in the Vietnamese language.
Towards an improved methodology for automated readability prediction
2010
Abstract Readability formulas are often employed to automatically predict the readability of an unseen text. In this article, the formulas and the text characteristics they are composed of are evaluated in the context of large corpora. We describe the behaviour of the formulas and the text characteristics by means of correlation matrices, principal component analysis and a collinearity test. We show methodological shortcomings to some of the existing readability formulas.
2015
This study is about the development of a learner-focused text readability indexing tool for second language learners (L2) of English. Student essays are used to calibrate the system, making it capable of providing realistic approximation of L2s’ actual reading ability spectrum. The system aims to promote self-directed (i.e. selfstudy) language learning and help even those L2s who can not afford formal education. In this paper, we provide a comparative review of two vectorial semantics-based algorithms, namely, Latent Semantic Indexing (LSI) and Concept Indexing (CI) for text content analysis. Since these algorithms rely on the bag-of-words approach and inherently lack grammar-related analysis, we augment them by incorporating Part-of-Speech (POS) n-gram features to approximate syntactic complexity of the text documents. Based on the results, CI-based features outperformed LSI-based features in most of the experiments. Without the integration of POS n-gram features, the difference be...