Joseph Marvin Imperial | National University (Philippines) (original) (raw)
Papers by Joseph Marvin Imperial
2021 International Conference on Asian Language Processing (IALP)
In this study, we pioneer the development of an audio-based hate speech classifier from online, s... more In this study, we pioneer the development of an audio-based hate speech classifier from online, short-form TikTok videos using traditional machine learning algorithms such as Logistic Regression, Random Forest, and Support Vector Machines. We scraped over 4746 videos using the TikTok API tool and extracted audio-based features such as MFCCs, Spectral Centroid, Rolloff, Bandwidth, Zero-Crossing Rate, and Chroma values as primary feature sets. Results show that using the extracted predictors for hate speech detection can obtain up to 78.5% accuracy on an optimized Random Forest model, crossing the 50% benchmark for models in this task. In addition, comparing the Information Gain scores and globally learned model weights identified that Spectral Rolloff and MFCCs are top predictors in discriminating hate speech for the Filipino language.
2021 5th International Conference on E-Society, E-Education and E-Technology, 2021
The impact of hate speech is not only detrimental to an individual's human rights; but also, ... more The impact of hate speech is not only detrimental to an individual's human rights; but also, a grave threat to social stability and democracy. Through social media, the spread of hate speech has alarmingly increased across the globe. Various social media platform's goal is to eliminate hateful content and this challenge poses the need for automatic and accurate hate speech detection. Presently, known techniques in this research primarily made use of either text or audio features. However, the use of the facial expression in hate speech detection is not that explored. Thus, for this study, the use of facial expressions to understand hate speech has been thoroughly investigated. The dataset used is image data generated from Filipino Tiktok videos with a frame size of 1080 x 1920 pixels and divided into 5 frames per second. Two approaches namely conventional and deep learning-based frameworks have been implemented in building the Facial Expression Recognition (FER) model to und...
2021 5th International Conference on E-Society, E-Education and E-Technology, 2021
With the rise of human-centric technologies such as social media platforms, the amount of hate al... more With the rise of human-centric technologies such as social media platforms, the amount of hate also continues to grow proportionally with the increasing number of users worldwide. TikTok is one of the most-used social media platforms due to its feature that allows users to express themselves via creating and sharing short-form videos based on any desired topic and content. In addition, it has also become a platform for political discourse and mudslinging as users can freely express an opinion and indirectly debate with random people online. In this study, we propose the use of BERT, a complex bidirectional transformer-based model, for the task of automatic hate speech detection from speech transcribed from Tagalog TikTok videos. Results of our experiments show that a BERT-based hate speech classifier scores 61% F1. We also extended the task beyond several algorithms such as LSTM, Naïve Bayes, and Decision Tree and found out that traditional methods such as a simple Bernoulli Naïve B...
In this paper, we present a unified model that works for both multilingual and crosslingual predi... more In this paper, we present a unified model that works for both multilingual and crosslingual prediction of reading times of words in various languages. The secret behind the success of this model is in the preprocessing step where all words are transformed to their universal language representation via the International Phonetic Alphabet (IPA). To the best of our knowledge, this is the first study to favorable exploit this phonological property of language for the two tasks. Various feature types were extracted covering basic frequencies, n-grams, information theoretic, and psycholinguistically-motivated predictors for model training. A finetuned Random Forest model obtained best performance for both tasks with 3.8031 and 3.9065 MAE scores for mean first fixation duration (FFDAve) and mean total reading time (TRTAve) respectively1.
ArXiv, 2021
Readability assessment is the process of identifying the level of ease or difficulty of a certain... more Readability assessment is the process of identifying the level of ease or difficulty of a certain piece of text for its intended audience. Approaches have evolved from the use of arithmetic formulas to more complex pattern-recognizing models trained using machine learning algorithms. While using these approaches provide competitive results, limited work is done on analyzing how linguistic variables affect model inference quantitatively. In this work, we dissect machine learning-based readability assessment models in Filipino by performing global and local model interpretation to understand the contributions of varying linguistic features and discuss its implications in the context of the Filipino language. Results show that using a model trained with top features from global interpretation obtained higher performance than the ones using features selected by Spearman correlation. Likewise, we also empirically observed local feature weight boundaries for discriminating reading difficu...
2019 IEEE 11th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management ( HNICEM ), 2019
Motif search is a common problem in bioinformatics where unique DNA sequences (motifs) of a speci... more Motif search is a common problem in bioinformatics where unique DNA sequences (motifs) of a specific length inscribed in long strands signify binding sites for transcription factors. In this paper, we present some important notes on the implementation of motif search using Gibbs sampling algorithm in a distributed computing environment by analyzing visualization on speed and motif scoring of various distributed implementations. For the DNA sequences data, we used an open-source mouse genome fragments with lengths 250, 500, and 1000. We built upon our previous studies (Perera and Ragel, 2013; Chen and Jiang, 2006) by integrating a distributed environment of the motif search workloads (jobs) across 16 CPU cores contained on 2 computer nodes instead of the traditional way of parallelizing on a single computing device with multicore CPUs. Results show that using saving the DNA sequences in list and adding as a parameter argument obtained the fastest execution time compared to implementa...
ArXiv, 2021
In order to ensure quality and effective learning, fluency, and comprehension, the proper identif... more In order to ensure quality and effective learning, fluency, and comprehension, the proper identification of the difficulty levels of reading materials should be observed. In this paper, we describe the development of automatic machine learning-based readability assessment models for educational Filipino texts using the most diverse set of linguistic features for the language. Results show that using a Random Forest model obtained a high performance of 62.7% in terms of accuracy, and 66.1% when using the optimal combination of feature sets consisting of traditional and syllable pattern-based predictors.
2019 International Conference on Asian Language Processing (IALP), 2019
In this paper, we present an experimental development of a spell checker for the Tagalog language... more In this paper, we present an experimental development of a spell checker for the Tagalog language using a set of word list with 300 random root words and three inflected forms as training data and a two-layered architecture of combined Deterministic Finite Automaton (DFA) with Levenshtein edit-distance. A DFA is used to process strings to identify if it belongs to a certain language via the binary result of accept or reject. The Levenshtein edit-distance of two strings is the number (k) of deletions, alterations, insertions between two sequences of characters. From the sample trained wordlist, results show that a value of 1 for the edit-distance (k) can be effective in spelling Tagalog sentences. Any value greater than 1 can cause suggestion of words even if the spelling of words is correct due to selective and prominent usage of certain characters in the Tagalog language like a, n, g, t, s, l.
ArXiv, 2021
Assessing the proper difficulty levels of reading materials or texts in general is the first step... more Assessing the proper difficulty levels of reading materials or texts in general is the first step towards effective comprehension and learning. In this study, we improve the conventional methodology of automatic readability assessment by incorporating the Word Mover’s Distance (WMD) of ranked texts as an additional post-processing technique to further ground the difficulty level given by a model. Results of our experiments on three multilingual datasets in Filipino, German, and English show that the post-processing technique outperforms previous vanilla and ranking-based models using SVM1
2020 International Conference on Asian Language Processing (IALP), 2020
Proper identification of the difficulty level of materials prescribed as required readings in an ... more Proper identification of the difficulty level of materials prescribed as required readings in an educational setting is key towards effective learning in children. Educators and publishers have relied on readability formulas in predicting text readability. While these formulas abound in the English language, limited work has been done on automatic readability assessment for the Filipino language. In this study, we build upon the previous works using traditional (TRAD) and lexical (LEX) linguistic features by incorporating language model (LM) features for possible improvement in identifying readability levels of Filipino storybooks. Results showed that combining LM predictors to TRAD and LEX, forming a hybrid feature set, increased the performances of readability models trained using Logistic Regression and Support Vector Machines by up to approx\approxapprox 25% – 32%. From the results of performing feature selection using Spearman correlation and Information Gain on the feature set, we found...
2021 IEEE Global Humanitarian Technology Conference (GHTC)
Readability formulas consider word familiarity as one of the factors for predicting the readabili... more Readability formulas consider word familiarity as one of the factors for predicting the readability of children's books. Word familiarity is dependent on the frequency in which the words are encountered in daily reading. Often referred to as "sight words", developing effective recognition of these high-frequency words can assist young readers to develop their reading fluency and comprehension. In this paper, we describe our work in building a dictionary of sight words for Filipino with the use of a corpus of Filipino literary materials written for children. We expanded the dictionary to a total of 664 words with the use of pre-trained word embedding model. The availability of such dictionary can facilitate the development of a readability formula for Filipino text, especially in the context of its lexical complexity.
In this paper, we describe our efforts in establishing a simple knowledge base by building a sema... more In this paper, we describe our efforts in establishing a simple knowledge base by building a semantic network composed of concepts and word relationships in the context of disasters in the Philippines. Our primary source of data is a collection of news articles scraped from various Philippine news websites. Using word embeddings, we extract semantically similar and co-occurring words from an initial seed words list. We arrive at an expanded ontology with a total of 450 word assertions. We let experts from the fields of linguistics, disasters, and weather science evaluate our knowledge base and arrived at an agreeability rate of 64%. We then perform a time-based analysis of the assertions to identify important semantic changes captured by the knowledge base such as the (a) trend of roles played by human entities, (b) memberships of human entities, and (c) common association of disaster-related words. The context-specific knowledge base developed from this study can be adapted by inte...
Automatic readability assessment (ARA) is the task of evaluating the level of ease or difficulty ... more Automatic readability assessment (ARA) is the task of evaluating the level of ease or difficulty of text documents for a target audience. For researchers, one of the many open problems in the field is to make such models trained for the task show efficacy even for lowresource languages. In this study, we propose an alternative way of utilizing the informationrich embeddings of BERT models through a joint-learning method combined with handcrafted linguistic features for readability assessment. Results show that the proposed method outperforms classical approaches in readability assessment using English and Filipino datasets, and obtaining as high as 12.4% increase in F1 performance. We also show that the knowledge encoded in BERT embeddings can be used as a substitute feature set for lowresource languages like Filipino with limited semantic and syntactic NLP tools to explicitly extract feature values for the task.
Proper identification of grade levels of children’s reading materials is an important step toward... more Proper identification of grade levels of children’s reading materials is an important step towards effective learning. Recent studies in readability assessment for the English domain applied modern approaches in natural language processing (NLP) such as machine learning (ML) techniques to automate the process. There is also a need to extract the correct linguistic features when modeling readability formulas. In the context of the Filipino language, limited work has been done [1, 2], especially in considering the language’s lexical complexity as main features. In this paper, we explore the use of lexical features towards improving the development of readability identification of children’s books written in Filipino. Results show that combining lexical features (LEX) consisting of type-token ratio, lexical density, lexical variation, foreign word count with traditional features (TRAD) used by previous works such as sentence length, average syllable length, polysyllabic words, word, se...
Handwriting is a skill to express thoughts, ideas, and language. Over the years, medical doctors ... more Handwriting is a skill to express thoughts, ideas, and language. Over the years, medical doctors have been well-known for having illegible cursive handwritings and has been a generally accepted matter. The datasets used in this paper are samples of doctors cursive handwriting collected from several clinics and hospitals of Metro Manila, Quezon City and Taytay, Rizal. In this paper, we present the Handwriting Recognition System using Deep Convolutional Recurrent Neural Network that is developed in order to identify the text in the image of prescriptions written by the doctors and show the readable text conversion of the cursive handwriting. In this study two models were evaluated and based on the experimentation CRNN with model-based normalization scheme than the CRNN alone. This study achieved 76% training accuracy rate and the developed model was found successfully implemented in a mobile application, having achieved a validation accuracy of 72% for the validation set from the rema...
The Philippines is a common ground to natural calamities like typhoons, floods, volcanic eruption... more The Philippines is a common ground to natural calamities like typhoons, floods, volcanic eruptions and earthquakes. With Twitter as one of the most used social media platform in the Philippines, a total of 39,867 preprocessed tweets were obtained given a time frame starting from November 1, 2013 to January 31, 2014. Sentiment analysis determines the underlying emotion given a series of words. The main purpose of this study is to identify the sentiments expressed in the tweets sent by the Filipino people before, during, and after Typhoon Yolanda using two variations of Recurrent Neural Networks; standard and bidirectional. The best generated models after training with various hyperparameters achieved a high accuracy of 81.79% for fine-grained classification using standard RNN and 87.69% for binary classification using bidirectional RNN. Findings revealed that 51.1% of the tweets sent were positive expressing support, love, and words of courage to the victims; 19.8% were negative stat...
Reading is an essential part of children’s learning. Identifying the proper readability level of ... more Reading is an essential part of children’s learning. Identifying the proper readability level of reading materials will ensure effective comprehension. We present our efforts to develop a baseline model for automatically identifying the readability of children’s and young adult’s books written in Filipino using machine learning algorithms. For this study, we processed 258 picture books published by Adarna House Inc. In contrast to old readability formulas relying on static attributes like number of words, sentences, syllables, etc., other textual features were explored. Count vectors, Term FrequencyInverse Document Frequency (TF-IDF), n-grams, and character-level n-grams were extracted to train models using three major machine learning algorithms–Multinomial Naïve-Bayes, Random Forest, and K-Nearest Neighbors. A combination of K-Nearest Neighbors and Random Forest via voting-based classification mechanism resulted with the best performing model with a high average training accuracy ...
One of the most important humanitarian responsibility of every individual is to protect the futur... more One of the most important humanitarian responsibility of every individual is to protect the future of our children. This entails not only protection of physical welfare but also from ill events that can potentially affect the mental well-being of a child such as sexual coercion and abuse which, in worst case scenarios, can result to lifelong trauma. In this study, we perform a preliminary investigation of how child sex peddlers spread illegal pornographic content and target minors for sexual activities on Twitter in the Philippines using Natural Language Processing techniques. Results of our studies show frequently used and co-occurring words that traffickers use to spread content as well as four main roles played by these entities that contribute to the proliferation of the child pornography in the country.
2019 IEEE 11th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management ( HNICEM ), 2019
Social Media holds a substantial amount of text data that can help organizations better understan... more Social Media holds a substantial amount of text data that can help organizations better understand their clients. For students of National University (NU) – Manila, Facebook serves as a medium to express their opinions and create topics for discussion that may generally speak about the University. Through Topic Modeling using Latent Dirichlet Allocation (LDA), various experiments were conducted to identify the topics discussed by the students based on the highest coherence score value obtained. From these experiments, a total of twenty (20) topics with Alpha and Beta values set to one (1) revealed the highest coherence. The topics were labeled and revealed interesting insights. Personal relationships and school-related concerns were the common topics posted on the two Facebook pages. To further improve the study, a chronological approach for topic modeling is recommended.
2021 International Conference on Asian Language Processing (IALP)
In this study, we pioneer the development of an audio-based hate speech classifier from online, s... more In this study, we pioneer the development of an audio-based hate speech classifier from online, short-form TikTok videos using traditional machine learning algorithms such as Logistic Regression, Random Forest, and Support Vector Machines. We scraped over 4746 videos using the TikTok API tool and extracted audio-based features such as MFCCs, Spectral Centroid, Rolloff, Bandwidth, Zero-Crossing Rate, and Chroma values as primary feature sets. Results show that using the extracted predictors for hate speech detection can obtain up to 78.5% accuracy on an optimized Random Forest model, crossing the 50% benchmark for models in this task. In addition, comparing the Information Gain scores and globally learned model weights identified that Spectral Rolloff and MFCCs are top predictors in discriminating hate speech for the Filipino language.
2021 5th International Conference on E-Society, E-Education and E-Technology, 2021
The impact of hate speech is not only detrimental to an individual's human rights; but also, ... more The impact of hate speech is not only detrimental to an individual's human rights; but also, a grave threat to social stability and democracy. Through social media, the spread of hate speech has alarmingly increased across the globe. Various social media platform's goal is to eliminate hateful content and this challenge poses the need for automatic and accurate hate speech detection. Presently, known techniques in this research primarily made use of either text or audio features. However, the use of the facial expression in hate speech detection is not that explored. Thus, for this study, the use of facial expressions to understand hate speech has been thoroughly investigated. The dataset used is image data generated from Filipino Tiktok videos with a frame size of 1080 x 1920 pixels and divided into 5 frames per second. Two approaches namely conventional and deep learning-based frameworks have been implemented in building the Facial Expression Recognition (FER) model to und...
2021 5th International Conference on E-Society, E-Education and E-Technology, 2021
With the rise of human-centric technologies such as social media platforms, the amount of hate al... more With the rise of human-centric technologies such as social media platforms, the amount of hate also continues to grow proportionally with the increasing number of users worldwide. TikTok is one of the most-used social media platforms due to its feature that allows users to express themselves via creating and sharing short-form videos based on any desired topic and content. In addition, it has also become a platform for political discourse and mudslinging as users can freely express an opinion and indirectly debate with random people online. In this study, we propose the use of BERT, a complex bidirectional transformer-based model, for the task of automatic hate speech detection from speech transcribed from Tagalog TikTok videos. Results of our experiments show that a BERT-based hate speech classifier scores 61% F1. We also extended the task beyond several algorithms such as LSTM, Naïve Bayes, and Decision Tree and found out that traditional methods such as a simple Bernoulli Naïve B...
In this paper, we present a unified model that works for both multilingual and crosslingual predi... more In this paper, we present a unified model that works for both multilingual and crosslingual prediction of reading times of words in various languages. The secret behind the success of this model is in the preprocessing step where all words are transformed to their universal language representation via the International Phonetic Alphabet (IPA). To the best of our knowledge, this is the first study to favorable exploit this phonological property of language for the two tasks. Various feature types were extracted covering basic frequencies, n-grams, information theoretic, and psycholinguistically-motivated predictors for model training. A finetuned Random Forest model obtained best performance for both tasks with 3.8031 and 3.9065 MAE scores for mean first fixation duration (FFDAve) and mean total reading time (TRTAve) respectively1.
ArXiv, 2021
Readability assessment is the process of identifying the level of ease or difficulty of a certain... more Readability assessment is the process of identifying the level of ease or difficulty of a certain piece of text for its intended audience. Approaches have evolved from the use of arithmetic formulas to more complex pattern-recognizing models trained using machine learning algorithms. While using these approaches provide competitive results, limited work is done on analyzing how linguistic variables affect model inference quantitatively. In this work, we dissect machine learning-based readability assessment models in Filipino by performing global and local model interpretation to understand the contributions of varying linguistic features and discuss its implications in the context of the Filipino language. Results show that using a model trained with top features from global interpretation obtained higher performance than the ones using features selected by Spearman correlation. Likewise, we also empirically observed local feature weight boundaries for discriminating reading difficu...
2019 IEEE 11th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management ( HNICEM ), 2019
Motif search is a common problem in bioinformatics where unique DNA sequences (motifs) of a speci... more Motif search is a common problem in bioinformatics where unique DNA sequences (motifs) of a specific length inscribed in long strands signify binding sites for transcription factors. In this paper, we present some important notes on the implementation of motif search using Gibbs sampling algorithm in a distributed computing environment by analyzing visualization on speed and motif scoring of various distributed implementations. For the DNA sequences data, we used an open-source mouse genome fragments with lengths 250, 500, and 1000. We built upon our previous studies (Perera and Ragel, 2013; Chen and Jiang, 2006) by integrating a distributed environment of the motif search workloads (jobs) across 16 CPU cores contained on 2 computer nodes instead of the traditional way of parallelizing on a single computing device with multicore CPUs. Results show that using saving the DNA sequences in list and adding as a parameter argument obtained the fastest execution time compared to implementa...
ArXiv, 2021
In order to ensure quality and effective learning, fluency, and comprehension, the proper identif... more In order to ensure quality and effective learning, fluency, and comprehension, the proper identification of the difficulty levels of reading materials should be observed. In this paper, we describe the development of automatic machine learning-based readability assessment models for educational Filipino texts using the most diverse set of linguistic features for the language. Results show that using a Random Forest model obtained a high performance of 62.7% in terms of accuracy, and 66.1% when using the optimal combination of feature sets consisting of traditional and syllable pattern-based predictors.
2019 International Conference on Asian Language Processing (IALP), 2019
In this paper, we present an experimental development of a spell checker for the Tagalog language... more In this paper, we present an experimental development of a spell checker for the Tagalog language using a set of word list with 300 random root words and three inflected forms as training data and a two-layered architecture of combined Deterministic Finite Automaton (DFA) with Levenshtein edit-distance. A DFA is used to process strings to identify if it belongs to a certain language via the binary result of accept or reject. The Levenshtein edit-distance of two strings is the number (k) of deletions, alterations, insertions between two sequences of characters. From the sample trained wordlist, results show that a value of 1 for the edit-distance (k) can be effective in spelling Tagalog sentences. Any value greater than 1 can cause suggestion of words even if the spelling of words is correct due to selective and prominent usage of certain characters in the Tagalog language like a, n, g, t, s, l.
ArXiv, 2021
Assessing the proper difficulty levels of reading materials or texts in general is the first step... more Assessing the proper difficulty levels of reading materials or texts in general is the first step towards effective comprehension and learning. In this study, we improve the conventional methodology of automatic readability assessment by incorporating the Word Mover’s Distance (WMD) of ranked texts as an additional post-processing technique to further ground the difficulty level given by a model. Results of our experiments on three multilingual datasets in Filipino, German, and English show that the post-processing technique outperforms previous vanilla and ranking-based models using SVM1
2020 International Conference on Asian Language Processing (IALP), 2020
Proper identification of the difficulty level of materials prescribed as required readings in an ... more Proper identification of the difficulty level of materials prescribed as required readings in an educational setting is key towards effective learning in children. Educators and publishers have relied on readability formulas in predicting text readability. While these formulas abound in the English language, limited work has been done on automatic readability assessment for the Filipino language. In this study, we build upon the previous works using traditional (TRAD) and lexical (LEX) linguistic features by incorporating language model (LM) features for possible improvement in identifying readability levels of Filipino storybooks. Results showed that combining LM predictors to TRAD and LEX, forming a hybrid feature set, increased the performances of readability models trained using Logistic Regression and Support Vector Machines by up to approx\approxapprox 25% – 32%. From the results of performing feature selection using Spearman correlation and Information Gain on the feature set, we found...
2021 IEEE Global Humanitarian Technology Conference (GHTC)
Readability formulas consider word familiarity as one of the factors for predicting the readabili... more Readability formulas consider word familiarity as one of the factors for predicting the readability of children's books. Word familiarity is dependent on the frequency in which the words are encountered in daily reading. Often referred to as "sight words", developing effective recognition of these high-frequency words can assist young readers to develop their reading fluency and comprehension. In this paper, we describe our work in building a dictionary of sight words for Filipino with the use of a corpus of Filipino literary materials written for children. We expanded the dictionary to a total of 664 words with the use of pre-trained word embedding model. The availability of such dictionary can facilitate the development of a readability formula for Filipino text, especially in the context of its lexical complexity.
In this paper, we describe our efforts in establishing a simple knowledge base by building a sema... more In this paper, we describe our efforts in establishing a simple knowledge base by building a semantic network composed of concepts and word relationships in the context of disasters in the Philippines. Our primary source of data is a collection of news articles scraped from various Philippine news websites. Using word embeddings, we extract semantically similar and co-occurring words from an initial seed words list. We arrive at an expanded ontology with a total of 450 word assertions. We let experts from the fields of linguistics, disasters, and weather science evaluate our knowledge base and arrived at an agreeability rate of 64%. We then perform a time-based analysis of the assertions to identify important semantic changes captured by the knowledge base such as the (a) trend of roles played by human entities, (b) memberships of human entities, and (c) common association of disaster-related words. The context-specific knowledge base developed from this study can be adapted by inte...
Automatic readability assessment (ARA) is the task of evaluating the level of ease or difficulty ... more Automatic readability assessment (ARA) is the task of evaluating the level of ease or difficulty of text documents for a target audience. For researchers, one of the many open problems in the field is to make such models trained for the task show efficacy even for lowresource languages. In this study, we propose an alternative way of utilizing the informationrich embeddings of BERT models through a joint-learning method combined with handcrafted linguistic features for readability assessment. Results show that the proposed method outperforms classical approaches in readability assessment using English and Filipino datasets, and obtaining as high as 12.4% increase in F1 performance. We also show that the knowledge encoded in BERT embeddings can be used as a substitute feature set for lowresource languages like Filipino with limited semantic and syntactic NLP tools to explicitly extract feature values for the task.
Proper identification of grade levels of children’s reading materials is an important step toward... more Proper identification of grade levels of children’s reading materials is an important step towards effective learning. Recent studies in readability assessment for the English domain applied modern approaches in natural language processing (NLP) such as machine learning (ML) techniques to automate the process. There is also a need to extract the correct linguistic features when modeling readability formulas. In the context of the Filipino language, limited work has been done [1, 2], especially in considering the language’s lexical complexity as main features. In this paper, we explore the use of lexical features towards improving the development of readability identification of children’s books written in Filipino. Results show that combining lexical features (LEX) consisting of type-token ratio, lexical density, lexical variation, foreign word count with traditional features (TRAD) used by previous works such as sentence length, average syllable length, polysyllabic words, word, se...
Handwriting is a skill to express thoughts, ideas, and language. Over the years, medical doctors ... more Handwriting is a skill to express thoughts, ideas, and language. Over the years, medical doctors have been well-known for having illegible cursive handwritings and has been a generally accepted matter. The datasets used in this paper are samples of doctors cursive handwriting collected from several clinics and hospitals of Metro Manila, Quezon City and Taytay, Rizal. In this paper, we present the Handwriting Recognition System using Deep Convolutional Recurrent Neural Network that is developed in order to identify the text in the image of prescriptions written by the doctors and show the readable text conversion of the cursive handwriting. In this study two models were evaluated and based on the experimentation CRNN with model-based normalization scheme than the CRNN alone. This study achieved 76% training accuracy rate and the developed model was found successfully implemented in a mobile application, having achieved a validation accuracy of 72% for the validation set from the rema...
The Philippines is a common ground to natural calamities like typhoons, floods, volcanic eruption... more The Philippines is a common ground to natural calamities like typhoons, floods, volcanic eruptions and earthquakes. With Twitter as one of the most used social media platform in the Philippines, a total of 39,867 preprocessed tweets were obtained given a time frame starting from November 1, 2013 to January 31, 2014. Sentiment analysis determines the underlying emotion given a series of words. The main purpose of this study is to identify the sentiments expressed in the tweets sent by the Filipino people before, during, and after Typhoon Yolanda using two variations of Recurrent Neural Networks; standard and bidirectional. The best generated models after training with various hyperparameters achieved a high accuracy of 81.79% for fine-grained classification using standard RNN and 87.69% for binary classification using bidirectional RNN. Findings revealed that 51.1% of the tweets sent were positive expressing support, love, and words of courage to the victims; 19.8% were negative stat...
Reading is an essential part of children’s learning. Identifying the proper readability level of ... more Reading is an essential part of children’s learning. Identifying the proper readability level of reading materials will ensure effective comprehension. We present our efforts to develop a baseline model for automatically identifying the readability of children’s and young adult’s books written in Filipino using machine learning algorithms. For this study, we processed 258 picture books published by Adarna House Inc. In contrast to old readability formulas relying on static attributes like number of words, sentences, syllables, etc., other textual features were explored. Count vectors, Term FrequencyInverse Document Frequency (TF-IDF), n-grams, and character-level n-grams were extracted to train models using three major machine learning algorithms–Multinomial Naïve-Bayes, Random Forest, and K-Nearest Neighbors. A combination of K-Nearest Neighbors and Random Forest via voting-based classification mechanism resulted with the best performing model with a high average training accuracy ...
One of the most important humanitarian responsibility of every individual is to protect the futur... more One of the most important humanitarian responsibility of every individual is to protect the future of our children. This entails not only protection of physical welfare but also from ill events that can potentially affect the mental well-being of a child such as sexual coercion and abuse which, in worst case scenarios, can result to lifelong trauma. In this study, we perform a preliminary investigation of how child sex peddlers spread illegal pornographic content and target minors for sexual activities on Twitter in the Philippines using Natural Language Processing techniques. Results of our studies show frequently used and co-occurring words that traffickers use to spread content as well as four main roles played by these entities that contribute to the proliferation of the child pornography in the country.
2019 IEEE 11th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management ( HNICEM ), 2019
Social Media holds a substantial amount of text data that can help organizations better understan... more Social Media holds a substantial amount of text data that can help organizations better understand their clients. For students of National University (NU) – Manila, Facebook serves as a medium to express their opinions and create topics for discussion that may generally speak about the University. Through Topic Modeling using Latent Dirichlet Allocation (LDA), various experiments were conducted to identify the topics discussed by the students based on the highest coherence score value obtained. From these experiments, a total of twenty (20) topics with Alpha and Beta values set to one (1) revealed the highest coherence. The topics were labeled and revealed interesting insights. Personal relationships and school-related concerns were the common topics posted on the two Facebook pages. To further improve the study, a chronological approach for topic modeling is recommended.