Ngan N T Nguyen | Vrije Universiteit Amsterdam (original) (raw)

Uploads

Papers by Ngan N T Nguyen

Research paper thumbnail of Identifying and Categorizing Offensive Language in Social Media

Offensive language detection has recently become a very important natural language processing tas... more Offensive language detection has recently become a very important natural language processing task. However, this task poses a challenge for the automatic process due to the complexity of the natural language constructs, especially language used on social media. In this paper, we describe an automatic offensive language detection method which extracts features at different conceptual levels and applies different machine learning and deep learning architecture .

Research paper thumbnail of Developing a new annotation schema for stance detection task

November, 2018

This paper reports on a project of developing a multi-dimensional stance annotation schema on utt... more This paper reports on a project of developing a multi-dimensional stance annotation schema on utterance level for stance detection in the vaccination corpus. The understanding of the stances in this corpus will shed the light on the debate of vaccination. The result shows a moderate agreement level between two annotators which results from the high complexity of a multi-dimensional stance model.

Research paper thumbnail of Men speak Martian and women speak Venusian: A study in gender and language

This study compares the informativeness and usefulness of three different types of feature: stylo... more This study compares the informativeness and usefulness of three different types of feature: stylometric features, word embedding features and document embedding features in the task of gender classification with the application of different machine learning models. The best result of 63% accuracy is achieved using word embedding features and Stochastic Gradient Descent classifier suggesting the power of word embeddings in describing the different language of men and women.

Research paper thumbnail of LEXICAL DENSITY AND READABILITY OF NON-ENGLISH MAJORED FRESHMEN'S WRITING IN VIETNAMESE CONTEXT

The principal objective of this investigation is to evaluate the lexical density and readability ... more The principal objective of this investigation is to evaluate the lexical density and readability in the writings of first-year Mathematics-majored students with dual majors in English at a pedagogical university in Hanoi. The data were collected from 26 written products, using two methods in calculating lexical density and readability proposed by Ure (1971) and Flesh (1994) respectively with the aid of some online text analyzers. The study shows that students can only achieve the average level of both lexical density and readability, which suggests that they need to enhance their writing skills with more complex grammar and vocabulary.

Research paper thumbnail of Comparison of different Digital Humanities tools: Data Extraction with Old Bailey Website Statistics, Amcat and KnowledgeStore

This paper presents the result of a data-driven project examining the application of three differ... more This paper presents the result of a data-driven project examining the application of three different digital toolsAmCat, Old Bailey Statistics, andKnowledgeStorein exploring the complexity of the Old Bailey dataset.

Research paper thumbnail of Developing a Machine Learning Event Factuality Classifier using the FactBank Corpus

We present an Event Factuality machine learning classification pipeline, which trains and tests o... more We present an Event Factuality machine learning classification pipeline, which trains and tests on the FactBank corpus. We detail the preprocessing and feature extraction steps, and report on our implementation of an XGBoost, and an SVM classifier, with the former scoring just over 78% accuracy.

Thesis Chapters by Ngan N T Nguyen

Research paper thumbnail of Clickbait anatomy: Identifying clickbait with machine learning

This research focuses on the exploration of linguistic patterns in clickbait, aiming at character... more This research focuses on the exploration of linguistic patterns in clickbait, aiming at characterizing clickbait from serious formal news using quantitative and qualitative analyses. Two significant findings about the nature of clickbait are discovered: (1) the are noticeable changes in terms of syntactic structures and topics in clickbait headlines, (2) the contents of a clickbait article can provide valuable discourse-level information that can be used to differentiate clickbait from non-clickbait. Based on the results of the analysis, three types of features are selected to be used for machine learning systems: stylometic features with encoded sequential part-of-speech and dependency tags, word embeddings, and document embeddings. The best system which uses Support Vector Machines algorithm and word embedding features achieves precision and recall scores of 0.82, as well as 82% of accuracy.

Research paper thumbnail of A CORPUS-ASSISTED DISCOURSE ANALYSIS OF ADJECTIVAL COLLOCATION OF THE WORD "EDUCATION" IN THE CORPUS OF GLOBAL WEB-BASED ENGLISH

This graduation thesis describes a corpus-based study of adjectival collocating with the word “ed... more This graduation thesis describes a corpus-based study of adjectival collocating with the word “education” in the American language of the Corpus of Global Web-based English. Analyzing the real-life language of the word “education” will suggest the public concerns and ideologies of the American around the topic of education. This study is expected to fill the gap in which almost no corpus linguistic research about the lexical “education” has been done. A combination of corpus-linguistic and discourse-analytical methods have been applied to examine not only language patterns but also social-political ideologies around the topic. Significant conclusions are deduced (1) there are a large number of adjectival collocates of the word education which have been identified and classified into four categories representing four different aspects of education: level, quality, forms and types of education; (2) education, as in combination with three first categories, carries the meaning as the act and process of teaching and learning while with the last category having the meaning of a particular kind of teaching or training; (3) higher education is the topic that gains most concerns from the American public; (4) five most significant ideologies are discovered from the corpus: higher education associates with financial affairs, higher education is an industry, the monetary policy of the government on higher education, people require greater access to higher education and people value higher education. The study contributes to the field of developing meanings of words through corpus analysis and the field of discourse analysis.

Research paper thumbnail of Identifying and Categorizing Offensive Language in Social Media

Offensive language detection has recently become a very important natural language processing tas... more Offensive language detection has recently become a very important natural language processing task. However, this task poses a challenge for the automatic process due to the complexity of the natural language constructs, especially language used on social media. In this paper, we describe an automatic offensive language detection method which extracts features at different conceptual levels and applies different machine learning and deep learning architecture .

Research paper thumbnail of Developing a new annotation schema for stance detection task

November, 2018

This paper reports on a project of developing a multi-dimensional stance annotation schema on utt... more This paper reports on a project of developing a multi-dimensional stance annotation schema on utterance level for stance detection in the vaccination corpus. The understanding of the stances in this corpus will shed the light on the debate of vaccination. The result shows a moderate agreement level between two annotators which results from the high complexity of a multi-dimensional stance model.

Research paper thumbnail of Men speak Martian and women speak Venusian: A study in gender and language

This study compares the informativeness and usefulness of three different types of feature: stylo... more This study compares the informativeness and usefulness of three different types of feature: stylometric features, word embedding features and document embedding features in the task of gender classification with the application of different machine learning models. The best result of 63% accuracy is achieved using word embedding features and Stochastic Gradient Descent classifier suggesting the power of word embeddings in describing the different language of men and women.

Research paper thumbnail of LEXICAL DENSITY AND READABILITY OF NON-ENGLISH MAJORED FRESHMEN'S WRITING IN VIETNAMESE CONTEXT

The principal objective of this investigation is to evaluate the lexical density and readability ... more The principal objective of this investigation is to evaluate the lexical density and readability in the writings of first-year Mathematics-majored students with dual majors in English at a pedagogical university in Hanoi. The data were collected from 26 written products, using two methods in calculating lexical density and readability proposed by Ure (1971) and Flesh (1994) respectively with the aid of some online text analyzers. The study shows that students can only achieve the average level of both lexical density and readability, which suggests that they need to enhance their writing skills with more complex grammar and vocabulary.

Research paper thumbnail of Comparison of different Digital Humanities tools: Data Extraction with Old Bailey Website Statistics, Amcat and KnowledgeStore

This paper presents the result of a data-driven project examining the application of three differ... more This paper presents the result of a data-driven project examining the application of three different digital toolsAmCat, Old Bailey Statistics, andKnowledgeStorein exploring the complexity of the Old Bailey dataset.

Research paper thumbnail of Developing a Machine Learning Event Factuality Classifier using the FactBank Corpus

We present an Event Factuality machine learning classification pipeline, which trains and tests o... more We present an Event Factuality machine learning classification pipeline, which trains and tests on the FactBank corpus. We detail the preprocessing and feature extraction steps, and report on our implementation of an XGBoost, and an SVM classifier, with the former scoring just over 78% accuracy.

Research paper thumbnail of Clickbait anatomy: Identifying clickbait with machine learning

This research focuses on the exploration of linguistic patterns in clickbait, aiming at character... more This research focuses on the exploration of linguistic patterns in clickbait, aiming at characterizing clickbait from serious formal news using quantitative and qualitative analyses. Two significant findings about the nature of clickbait are discovered: (1) the are noticeable changes in terms of syntactic structures and topics in clickbait headlines, (2) the contents of a clickbait article can provide valuable discourse-level information that can be used to differentiate clickbait from non-clickbait. Based on the results of the analysis, three types of features are selected to be used for machine learning systems: stylometic features with encoded sequential part-of-speech and dependency tags, word embeddings, and document embeddings. The best system which uses Support Vector Machines algorithm and word embedding features achieves precision and recall scores of 0.82, as well as 82% of accuracy.

Research paper thumbnail of A CORPUS-ASSISTED DISCOURSE ANALYSIS OF ADJECTIVAL COLLOCATION OF THE WORD "EDUCATION" IN THE CORPUS OF GLOBAL WEB-BASED ENGLISH

This graduation thesis describes a corpus-based study of adjectival collocating with the word “ed... more This graduation thesis describes a corpus-based study of adjectival collocating with the word “education” in the American language of the Corpus of Global Web-based English. Analyzing the real-life language of the word “education” will suggest the public concerns and ideologies of the American around the topic of education. This study is expected to fill the gap in which almost no corpus linguistic research about the lexical “education” has been done. A combination of corpus-linguistic and discourse-analytical methods have been applied to examine not only language patterns but also social-political ideologies around the topic. Significant conclusions are deduced (1) there are a large number of adjectival collocates of the word education which have been identified and classified into four categories representing four different aspects of education: level, quality, forms and types of education; (2) education, as in combination with three first categories, carries the meaning as the act and process of teaching and learning while with the last category having the meaning of a particular kind of teaching or training; (3) higher education is the topic that gains most concerns from the American public; (4) five most significant ideologies are discovered from the corpus: higher education associates with financial affairs, higher education is an industry, the monetary policy of the government on higher education, people require greater access to higher education and people value higher education. The study contributes to the field of developing meanings of words through corpus analysis and the field of discourse analysis.