Christian E . Maldonado-Sifuentes - Profile on Academia.edu (original) (raw)
Drafts by Christian E . Maldonado-Sifuentes
Lecture Notes in Computer Science, 2023
This paper offers a comparative analysis of two state-of-the-art machine translation models for S... more This paper offers a comparative analysis of two state-of-the-art machine translation models for Spanish to Indigenous languages of Colombia and Mexico, with the aim of investigating their effectiveness and limitations under low-resource conditions. Our methodology involved aligning verse pairs text using the Bible for twelve Indigenous languages and constructing parallel datasets for evaluation using BLEU and ROUGE metrics. The results demonstrate that transformer-based models can deliver competitive performance in translating from Spanish to Indigenous languages with minimal configuration. In particular, we found the Opus-based model obtained the best performance in 11 of the languages in the test set but, the Fairseq model performs competitively in scenarios where training data is more scarce. Additionally, we provide a comprehensive analysis of the findings, including insights into the strengths and limitations of the models. Finally, we suggest potential directions for future research in low-resource language translation, specifically in the context of Latin American indigenous languages.
Papers by Christian E . Maldonado-Sifuentes
IJCOPI, 2024
Indigenous languages like Purépecha face significant challenges in the modern era, particularly d... more Indigenous languages like Purépecha face significant challenges in the modern era, particularly due to limited digital resources and a dwindling number of speakers. This study, conducted by researchers from CIC-IPN and CONAHCYT, presents a novel application of Transformer-based neural networks for the automatic translation of Purépecha to Spanish. The work includes the creation of a comprehensive bilingual corpus, the implementation of a sophisticated model architecture, and extensive training and evaluation processes. The results indicate directions for leveraging AI to preserve and revitalize indigenous languages.
Leveraging Machine Learning to Unveil the Critical Role of Geographic Factors in COVID-19 Mortality in Mexico
Computación y sistemas, Mar 20, 2024
Comparing Transformer-Based Machine Translation Models for Low-Resource Languages of Colombia and Mexico
Lecture Notes in Computer Science, Nov 8, 2023
In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec corpus for machine transl... more In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec corpus for machine translation (MT) tasks, where Mazatec and Mixtec are two indigenous Mexican languages. We evaluated the usability of the collected corpus using three different approaches: transformer, transfer learning, and fine-tuning pre-trained multilingual MT models. Finetuning the Facebook M2M100-48 model outperformed the other approaches, with BLEU scores of 12.09 and 22.25 for Mazatec-Spanish and Spanish-Mazatec translations, respectively, and 16.75 and 22.15 for Mixtec-Spanish and Spanish-Mixtec translations, respectively. The findings show that the dataset size (9,799 sentences in Mazatec and 13,235 sentences in Mixtec) affects translation performance and that indigenous languages work better when used as target languages. The findings emphasize the importance of creating parallel corpora for indigenous languages and fine-tuning models for low-resource translation tasks. Future research will investigate zero-shot and few-shot learning approaches to further improve translation performance in low-resource settings. The dataset and scripts are available at
Polibits, 2024
In the present manuscript, we introduce a novel and holistic architecture for the ProtoAGI system... more In the present manuscript, we introduce a novel and holistic architecture for the ProtoAGI system, conceptualized from a systems engineering standpoint. This architecture is elaborately crafted to emulate artificial general intelligence (AGI) through the integration of diverse components and knowledge frameworks, thereby augmenting its performance and adaptability. We meticulously delineate the system's proficiency in processing intricate user inputs, its capacity for adaptive learning from historical datasets, and its ability to generate responses that are contextually relevant. The cornerstone of our proposition is the intricate orchestration of Large Language Models (LLMs), task-specific solvers, and a comprehensive knowledge repository, which collectively propel the system towards achieving genuine adaptability and autonomous learning capabilities. This approach not only signifies a pioneering venture into the realm of AGI system design but also lays the groundwork for subsequent advancements in this field.
IEEE Access
In recent times, there has been a growing interest in the domain of computationally challenging p... more In recent times, there has been a growing interest in the domain of computationally challenging problem solving within both scientific and organizational contexts. This study is primarily concerned with the extraction and comprehension of the methodologies and strategies employed by individuals when confronted with intricate problems, specifically those falling under the purview of NP-hard problems. The Facility Location Problem (FLP) serves as a prominent exemplar within this study's framework. Traditionally, the handling of such complex problems has leaned upon intuitive reasoning and visual perception as the primary tools. However, these conventional approaches tend to provide only limited insight into the underlying processes employed in solving such problems. The present research seeks to bridge this knowledge gap through the utilization of advanced machine learning techniques for the purpose of categorizing and scrutinizing the strategies deployed by individuals in their attempts to tackle computationally challenging problems. The analysis conducted as part of this study unveils discernible and well-defined patterns and strategies that are employed by participants, some of whom have achieved notable levels of success. Remarkably, in certain instances, the outcomes achieved by these individuals have demonstrated a competitive edge when compared to the results produced by sophisticated computational methods, such as genetic algorithms. A fundamental component of our research methodology involves the application of heatmaps and clustering techniques. Through the normalization of results, our findings distinctly delineate two primary categories of games: those characterized by uniform player strategies and those characterized by a multitude of diverse and individualized tactics. Furthermore, our research employs a systematic approach to represent games by clustering them based on inherent similarities, utilizing cosine similarity as a metric for this purpose. By computing the averages of vectors within each cluster, we derive centroids that encapsulate the central tendencies exhibited by games belonging to that cluster. These centroids are then visually presented in a three-dimensional format, complemented by proportional spheres. These visual representations serve to vividly illustrate the dispersion and influence associated with each cluster. Our research significantly contributes to the understanding of human problem-solving strategies when confronted with computationally challenging problems. It unearths valuable insights regarding the potential for harnessing human intuition and expertise in addressing complex computational challenges. Through the integration of machine learning methodologies and intuitive visualizations, this work advances our comprehension of the approaches individuals employ to excel in solving computationally intricate problems. The associate editor coordinating the review of this manuscript and approving it for publication was Wei Liu.
Skin Cancer Diagnosis Enhancement Through NLP and DNN-Based Binary Classification
Studies in fuzziness and soft computing, 2023
Virality Prediction for News Tweets Using RoBERTa
Lecture Notes in Computer Science, 2021
arXiv (Cornell University), May 27, 2023
In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec corpus for machine transl... more In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec corpus for machine translation (MT) tasks, where Mazatec and Mixtec are two indigenous Mexican languages. We evaluated the usability of the collected corpus using three different approaches: transformer, transfer learning, and fine-tuning pre-trained multilingual MT models. Finetuning the Facebook M2M100-48 model outperformed the other approaches, with BLEU scores of 12.09 and 22.25 for Mazatec-Spanish and Spanish-Mazatec translations, respectively, and 16.75 and 22.15 for Mixtec-Spanish and Spanish-Mixtec translations, respectively. The findings show that the dataset size (9,799 sentences in Mazatec and 13,235 sentences in Mixtec) affects translation performance and that indigenous languages work better when used as target languages. The findings emphasize the importance of creating parallel corpora for indigenous languages and fine-tuning models for low-resource translation tasks. Future research will investigate zero-shot and few-shot learning approaches to further improve translation performance in low-resource settings. The dataset and scripts are available at https://github.com/atnafuatx/ Machine-Translation-Resources.
Improved Twitter Virality Prediction using Text and RNN-LSTM
Distribute the primes into groups in ascending order, with the n-th group having prime(n) elements. Then a(n) is the sum of the numbers in the n-th group times the number of elements in the group
Distribute the primes into groups in ascending order, with the n-th group having prime(n) element... more Distribute the primes into groups in ascending order, with the n-th group having prime(n) elements. Then a(n) is the sum of the numbers in the n-th group times the number of elements in the group.
Authorship Attribution Through Punctuation N Grams and Averaged Combination of SVM Notebook for Pan at Clef 2019
CEUR Workshop Proceedings, 2019
This work explores the exploitation of pre-processing, feature extraction and the averaged combin... more This work explores the exploitation of pre-processing, feature extraction and the averaged combination of Support Vector Machines (SVM) outputs for the open-set Cross-Domain Authorship Attribution task. The use of punctuation n-grams as a feature representation of a document is introduced for the Authorship Attribution in combination with traditional character n-grams. Starting from different feature representations of a document, several SVM are trained to represent the probability of membership for a certain author to latter obtain an average of all the SVM results. This approach managed to obtain 0.642 with the Macro F1-score for the PAN 2019 contest of open-set Cross-Domain Authorship Attribution.
Research in Computing Science, 2016
The growing relevance of social media for the communication strategies of public entities-like go... more The growing relevance of social media for the communication strategies of public entities-like governmental, political, commercial and NGOs, and public personalities-has been widely accepted in recent years. It is plausible that the lack of a method for delivering content to the general public diminishes the reach and impact of the communication strategies of any such entities. It is commonly accepted that the general public decides which contents to interact with, mostly on an emotional basis. Thus, such a method should be more effective if common psychological factors are taken into account. The method proposed in this paper has been developed and tested empirically working with nineteen public entities. The method has been applied on different entities achieving improvements in content reach on every case, this suggests that the application of a method that guides the entity through this process is effective in improving the reach and engagement of the public with the content. Because of the empirical methodology used to develop and test this method, it is in a very precarious and early state. Still, having shown promise, plans to increase its robustness include research in the areas of AI automation, neuromarketing and systems analysis, as well as the development of falsifiable tests for application.
Virality Prediction for News Tweets Using RoBERTa
Advances in Soft Computing, 2021
OEIS, 2018
Distribute the primes into groups in ascending order, with the n-th group having prime(n) element... more Distribute the primes into groups in ascending order, with the n-th group having prime(n) elements. Then a(n) is the sum of the numbers in the n-th group times the number of elements in the group.
PAN at CLEF, 2019
This work explores the exploitation of pre-processing, feature extraction and the averaged combin... more This work explores the exploitation of pre-processing, feature extraction and the averaged combination of Support Vector Machines (SVM) outputs for the open-set Cross-Domain Authorship Attribution task. The use of punctuation n-grams as a feature representation of a document is introduced for the Authorship Attribution in combination with traditional character n-grams. Starting from different feature representations of a document, several SVM are trained to represent the probability of membership for a certain author to latter obtain an average of all the SVM results. This approach managed to obtain 0.642 with the Macro F1-score for the PAN 2019 contest of open-set Cross-Domain Authorship Attribution.
Research In Computing Science, 2017
The growing relevance of social media for the communication strategies of public entities-includi... more The growing relevance of social media for the communication strategies of public entities-including governmental, political, commercial and non-profit organizations, as well as personalities-has been widely accepted in recent years. This paper assumes that the lack of a method for delivering content to the general public may diminish the reach and impact of the communication strategies of any such entities. It is commonly accepted that the general public decides which contents to interact with, mostly on an emotional basis. Thus, such a method should be more effective if common psychological factors are taken into account. The method proposed in this paper has been developed and tested empirically over the course of four years working with over twenty NGOs, political parties, local personalities and small companies, in different maturity stages varying from complete lack of knowledge, to highly developed teams dedicated to content creation. It encompasses the process starting from the creation of the entity, the setup of the necessary tools going into greater detail for the specifics of content creation including emotional titles for the content and the assumption that people engages better with the content-and eventually with the entity that delivers it-when they perceive their own power can be increased though that interaction; finally it defines actions to retain the public attracted by the content. The method has been applied on different entities achieving consistent improvements, having the higher impact on entities that had more freedom to apply the method and less initial knowledge and infrastructure. Therefore it can be concluded that the initial assumptions are apparently valid and the application of a method that guides the entity through the process taking into account common psychological factors of the public is effective in improving the engagement of it with the content produced by the entities.
NLP Indígenas, 2020
Abstract Introduction For a long time indigenous languages, in general, and Mexican, in particula... more Abstract
Introduction
For a long time indigenous languages, in general, and Mexican, in particular, have been a niche research area. Different disciplines, from Linguistics, anthropology, and sociology, to —in a lesser fashion— Computational Linguistics and Natural Language Processing, are traditional stakeholders on the topic. Still, the cost-benefit ratio is not favorable in the realm of computer sciences because of the scarcity of sources, divergence in variants of the same languages, lack of standard orthographies, and low topic awareness, as well as the perception of narrowness in the realm of applications, and lack of impact for these. Thus, NLP/CL researchers have remained on the fringes.
Development.
We believe that shining a different light on these apparent obstacles and dissuasion elements can help to improve the perception of their relevance and interest as research objects. On the one hand, speakers of these languages amount to approximately seven million in Mexico, and self-identified descendants of the Mexican indigenous cultures to almost four times as many, not a negligible number. On the other hand, the study of low-resource NLP is interesting in itself as a research problem, but it also can help develop tools for the social sciences and receive valuable insights from them, fostering an environment of multidisciplinary studies that, in turn, lead to the preservation, inclusion, cultural identity exaltation, and enhanced visibility of Mexican indigenous cultures. As we perceive sparks of renewed interest in low-resource languages amongst the researchers of the NLP/CL community assessing the current state we believe that an assessment of the present challenges for the area of study is pertinent. With this, we move on to develop a systemic analysis of the field and proceed to outline a technical roadmap proposal towards the inclusion of Mexican indigenous languages in NLP/CL mainstream research. Briefly, this roadmap has two main areas that eventually converge: technical and linguistic which coalesce through the multidisciplinary paradigm. Within these areas, several tasks are proposed both in parallel and in succession, amongst the most notorious are the formation of teams of linguists, social anthropologists, and computer scientists to create tools for the collection of languages that are socially acceptable within the culture and standards of the communities. Technically, the development of algorithms for small corpora and linguistically the development of dictionaries of equivalent orthographies for distinctive variants of these languages.
Conclusion
Even though the current panorama for the indigenous languages in NLP/CL research is bleak, the re-emergence of the focus on the exaltation and preservation of originary cultures and the push for multidisciplinary studies creates a fertile environment for new approaches towards the inclusion of these languages in mainstream NLP/CL research. The implications of the development of applied research in this area can build bridges for the inclusion of the indigenous communities in nation-wide social and political conversations from which they are now excluded via the digital divide. Also, it can have a tremendous impact on the preservation of the past and present versions of originary languages as well as enabling the visibility of these cultures.
PLN Indígenas, 2020
The availability of lexical resources is a cornerstone for endangered languages preservation and ... more The availability of lexical resources is a cornerstone for endangered languages preservation and documentation, they also constitute a primary source for language teaching and revitalization. For instance, Mexico has around 70 indigenous languages and XX variations spoken by~7 million people, which despite its cultural importance, lack digital presence, have poor data quality, and face language extinction. To confront these circumstances we made use of text mining approaches to collect and transform existing lexical resources into language-learning resources for four endangered languages of Mexico. Finally, we present an application for such learning resources using Anki, an open-source and multi-platform
Lecture Notes in Computer Science, 2023
This paper offers a comparative analysis of two state-of-the-art machine translation models for S... more This paper offers a comparative analysis of two state-of-the-art machine translation models for Spanish to Indigenous languages of Colombia and Mexico, with the aim of investigating their effectiveness and limitations under low-resource conditions. Our methodology involved aligning verse pairs text using the Bible for twelve Indigenous languages and constructing parallel datasets for evaluation using BLEU and ROUGE metrics. The results demonstrate that transformer-based models can deliver competitive performance in translating from Spanish to Indigenous languages with minimal configuration. In particular, we found the Opus-based model obtained the best performance in 11 of the languages in the test set but, the Fairseq model performs competitively in scenarios where training data is more scarce. Additionally, we provide a comprehensive analysis of the findings, including insights into the strengths and limitations of the models. Finally, we suggest potential directions for future research in low-resource language translation, specifically in the context of Latin American indigenous languages.
IJCOPI, 2024
Indigenous languages like Purépecha face significant challenges in the modern era, particularly d... more Indigenous languages like Purépecha face significant challenges in the modern era, particularly due to limited digital resources and a dwindling number of speakers. This study, conducted by researchers from CIC-IPN and CONAHCYT, presents a novel application of Transformer-based neural networks for the automatic translation of Purépecha to Spanish. The work includes the creation of a comprehensive bilingual corpus, the implementation of a sophisticated model architecture, and extensive training and evaluation processes. The results indicate directions for leveraging AI to preserve and revitalize indigenous languages.
Leveraging Machine Learning to Unveil the Critical Role of Geographic Factors in COVID-19 Mortality in Mexico
Computación y sistemas, Mar 20, 2024
Comparing Transformer-Based Machine Translation Models for Low-Resource Languages of Colombia and Mexico
Lecture Notes in Computer Science, Nov 8, 2023
In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec corpus for machine transl... more In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec corpus for machine translation (MT) tasks, where Mazatec and Mixtec are two indigenous Mexican languages. We evaluated the usability of the collected corpus using three different approaches: transformer, transfer learning, and fine-tuning pre-trained multilingual MT models. Finetuning the Facebook M2M100-48 model outperformed the other approaches, with BLEU scores of 12.09 and 22.25 for Mazatec-Spanish and Spanish-Mazatec translations, respectively, and 16.75 and 22.15 for Mixtec-Spanish and Spanish-Mixtec translations, respectively. The findings show that the dataset size (9,799 sentences in Mazatec and 13,235 sentences in Mixtec) affects translation performance and that indigenous languages work better when used as target languages. The findings emphasize the importance of creating parallel corpora for indigenous languages and fine-tuning models for low-resource translation tasks. Future research will investigate zero-shot and few-shot learning approaches to further improve translation performance in low-resource settings. The dataset and scripts are available at
Polibits, 2024
In the present manuscript, we introduce a novel and holistic architecture for the ProtoAGI system... more In the present manuscript, we introduce a novel and holistic architecture for the ProtoAGI system, conceptualized from a systems engineering standpoint. This architecture is elaborately crafted to emulate artificial general intelligence (AGI) through the integration of diverse components and knowledge frameworks, thereby augmenting its performance and adaptability. We meticulously delineate the system's proficiency in processing intricate user inputs, its capacity for adaptive learning from historical datasets, and its ability to generate responses that are contextually relevant. The cornerstone of our proposition is the intricate orchestration of Large Language Models (LLMs), task-specific solvers, and a comprehensive knowledge repository, which collectively propel the system towards achieving genuine adaptability and autonomous learning capabilities. This approach not only signifies a pioneering venture into the realm of AGI system design but also lays the groundwork for subsequent advancements in this field.
IEEE Access
In recent times, there has been a growing interest in the domain of computationally challenging p... more In recent times, there has been a growing interest in the domain of computationally challenging problem solving within both scientific and organizational contexts. This study is primarily concerned with the extraction and comprehension of the methodologies and strategies employed by individuals when confronted with intricate problems, specifically those falling under the purview of NP-hard problems. The Facility Location Problem (FLP) serves as a prominent exemplar within this study's framework. Traditionally, the handling of such complex problems has leaned upon intuitive reasoning and visual perception as the primary tools. However, these conventional approaches tend to provide only limited insight into the underlying processes employed in solving such problems. The present research seeks to bridge this knowledge gap through the utilization of advanced machine learning techniques for the purpose of categorizing and scrutinizing the strategies deployed by individuals in their attempts to tackle computationally challenging problems. The analysis conducted as part of this study unveils discernible and well-defined patterns and strategies that are employed by participants, some of whom have achieved notable levels of success. Remarkably, in certain instances, the outcomes achieved by these individuals have demonstrated a competitive edge when compared to the results produced by sophisticated computational methods, such as genetic algorithms. A fundamental component of our research methodology involves the application of heatmaps and clustering techniques. Through the normalization of results, our findings distinctly delineate two primary categories of games: those characterized by uniform player strategies and those characterized by a multitude of diverse and individualized tactics. Furthermore, our research employs a systematic approach to represent games by clustering them based on inherent similarities, utilizing cosine similarity as a metric for this purpose. By computing the averages of vectors within each cluster, we derive centroids that encapsulate the central tendencies exhibited by games belonging to that cluster. These centroids are then visually presented in a three-dimensional format, complemented by proportional spheres. These visual representations serve to vividly illustrate the dispersion and influence associated with each cluster. Our research significantly contributes to the understanding of human problem-solving strategies when confronted with computationally challenging problems. It unearths valuable insights regarding the potential for harnessing human intuition and expertise in addressing complex computational challenges. Through the integration of machine learning methodologies and intuitive visualizations, this work advances our comprehension of the approaches individuals employ to excel in solving computationally intricate problems. The associate editor coordinating the review of this manuscript and approving it for publication was Wei Liu.
Skin Cancer Diagnosis Enhancement Through NLP and DNN-Based Binary Classification
Studies in fuzziness and soft computing, 2023
Virality Prediction for News Tweets Using RoBERTa
Lecture Notes in Computer Science, 2021
arXiv (Cornell University), May 27, 2023
In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec corpus for machine transl... more In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec corpus for machine translation (MT) tasks, where Mazatec and Mixtec are two indigenous Mexican languages. We evaluated the usability of the collected corpus using three different approaches: transformer, transfer learning, and fine-tuning pre-trained multilingual MT models. Finetuning the Facebook M2M100-48 model outperformed the other approaches, with BLEU scores of 12.09 and 22.25 for Mazatec-Spanish and Spanish-Mazatec translations, respectively, and 16.75 and 22.15 for Mixtec-Spanish and Spanish-Mixtec translations, respectively. The findings show that the dataset size (9,799 sentences in Mazatec and 13,235 sentences in Mixtec) affects translation performance and that indigenous languages work better when used as target languages. The findings emphasize the importance of creating parallel corpora for indigenous languages and fine-tuning models for low-resource translation tasks. Future research will investigate zero-shot and few-shot learning approaches to further improve translation performance in low-resource settings. The dataset and scripts are available at https://github.com/atnafuatx/ Machine-Translation-Resources.
Improved Twitter Virality Prediction using Text and RNN-LSTM
Distribute the primes into groups in ascending order, with the n-th group having prime(n) elements. Then a(n) is the sum of the numbers in the n-th group times the number of elements in the group
Distribute the primes into groups in ascending order, with the n-th group having prime(n) element... more Distribute the primes into groups in ascending order, with the n-th group having prime(n) elements. Then a(n) is the sum of the numbers in the n-th group times the number of elements in the group.
Authorship Attribution Through Punctuation N Grams and Averaged Combination of SVM Notebook for Pan at Clef 2019
CEUR Workshop Proceedings, 2019
This work explores the exploitation of pre-processing, feature extraction and the averaged combin... more This work explores the exploitation of pre-processing, feature extraction and the averaged combination of Support Vector Machines (SVM) outputs for the open-set Cross-Domain Authorship Attribution task. The use of punctuation n-grams as a feature representation of a document is introduced for the Authorship Attribution in combination with traditional character n-grams. Starting from different feature representations of a document, several SVM are trained to represent the probability of membership for a certain author to latter obtain an average of all the SVM results. This approach managed to obtain 0.642 with the Macro F1-score for the PAN 2019 contest of open-set Cross-Domain Authorship Attribution.
Research in Computing Science, 2016
The growing relevance of social media for the communication strategies of public entities-like go... more The growing relevance of social media for the communication strategies of public entities-like governmental, political, commercial and NGOs, and public personalities-has been widely accepted in recent years. It is plausible that the lack of a method for delivering content to the general public diminishes the reach and impact of the communication strategies of any such entities. It is commonly accepted that the general public decides which contents to interact with, mostly on an emotional basis. Thus, such a method should be more effective if common psychological factors are taken into account. The method proposed in this paper has been developed and tested empirically working with nineteen public entities. The method has been applied on different entities achieving improvements in content reach on every case, this suggests that the application of a method that guides the entity through this process is effective in improving the reach and engagement of the public with the content. Because of the empirical methodology used to develop and test this method, it is in a very precarious and early state. Still, having shown promise, plans to increase its robustness include research in the areas of AI automation, neuromarketing and systems analysis, as well as the development of falsifiable tests for application.
Virality Prediction for News Tweets Using RoBERTa
Advances in Soft Computing, 2021
OEIS, 2018
Distribute the primes into groups in ascending order, with the n-th group having prime(n) element... more Distribute the primes into groups in ascending order, with the n-th group having prime(n) elements. Then a(n) is the sum of the numbers in the n-th group times the number of elements in the group.
PAN at CLEF, 2019
This work explores the exploitation of pre-processing, feature extraction and the averaged combin... more This work explores the exploitation of pre-processing, feature extraction and the averaged combination of Support Vector Machines (SVM) outputs for the open-set Cross-Domain Authorship Attribution task. The use of punctuation n-grams as a feature representation of a document is introduced for the Authorship Attribution in combination with traditional character n-grams. Starting from different feature representations of a document, several SVM are trained to represent the probability of membership for a certain author to latter obtain an average of all the SVM results. This approach managed to obtain 0.642 with the Macro F1-score for the PAN 2019 contest of open-set Cross-Domain Authorship Attribution.
Research In Computing Science, 2017
The growing relevance of social media for the communication strategies of public entities-includi... more The growing relevance of social media for the communication strategies of public entities-including governmental, political, commercial and non-profit organizations, as well as personalities-has been widely accepted in recent years. This paper assumes that the lack of a method for delivering content to the general public may diminish the reach and impact of the communication strategies of any such entities. It is commonly accepted that the general public decides which contents to interact with, mostly on an emotional basis. Thus, such a method should be more effective if common psychological factors are taken into account. The method proposed in this paper has been developed and tested empirically over the course of four years working with over twenty NGOs, political parties, local personalities and small companies, in different maturity stages varying from complete lack of knowledge, to highly developed teams dedicated to content creation. It encompasses the process starting from the creation of the entity, the setup of the necessary tools going into greater detail for the specifics of content creation including emotional titles for the content and the assumption that people engages better with the content-and eventually with the entity that delivers it-when they perceive their own power can be increased though that interaction; finally it defines actions to retain the public attracted by the content. The method has been applied on different entities achieving consistent improvements, having the higher impact on entities that had more freedom to apply the method and less initial knowledge and infrastructure. Therefore it can be concluded that the initial assumptions are apparently valid and the application of a method that guides the entity through the process taking into account common psychological factors of the public is effective in improving the engagement of it with the content produced by the entities.
NLP Indígenas, 2020
Abstract Introduction For a long time indigenous languages, in general, and Mexican, in particula... more Abstract
Introduction
For a long time indigenous languages, in general, and Mexican, in particular, have been a niche research area. Different disciplines, from Linguistics, anthropology, and sociology, to —in a lesser fashion— Computational Linguistics and Natural Language Processing, are traditional stakeholders on the topic. Still, the cost-benefit ratio is not favorable in the realm of computer sciences because of the scarcity of sources, divergence in variants of the same languages, lack of standard orthographies, and low topic awareness, as well as the perception of narrowness in the realm of applications, and lack of impact for these. Thus, NLP/CL researchers have remained on the fringes.
Development.
We believe that shining a different light on these apparent obstacles and dissuasion elements can help to improve the perception of their relevance and interest as research objects. On the one hand, speakers of these languages amount to approximately seven million in Mexico, and self-identified descendants of the Mexican indigenous cultures to almost four times as many, not a negligible number. On the other hand, the study of low-resource NLP is interesting in itself as a research problem, but it also can help develop tools for the social sciences and receive valuable insights from them, fostering an environment of multidisciplinary studies that, in turn, lead to the preservation, inclusion, cultural identity exaltation, and enhanced visibility of Mexican indigenous cultures. As we perceive sparks of renewed interest in low-resource languages amongst the researchers of the NLP/CL community assessing the current state we believe that an assessment of the present challenges for the area of study is pertinent. With this, we move on to develop a systemic analysis of the field and proceed to outline a technical roadmap proposal towards the inclusion of Mexican indigenous languages in NLP/CL mainstream research. Briefly, this roadmap has two main areas that eventually converge: technical and linguistic which coalesce through the multidisciplinary paradigm. Within these areas, several tasks are proposed both in parallel and in succession, amongst the most notorious are the formation of teams of linguists, social anthropologists, and computer scientists to create tools for the collection of languages that are socially acceptable within the culture and standards of the communities. Technically, the development of algorithms for small corpora and linguistically the development of dictionaries of equivalent orthographies for distinctive variants of these languages.
Conclusion
Even though the current panorama for the indigenous languages in NLP/CL research is bleak, the re-emergence of the focus on the exaltation and preservation of originary cultures and the push for multidisciplinary studies creates a fertile environment for new approaches towards the inclusion of these languages in mainstream NLP/CL research. The implications of the development of applied research in this area can build bridges for the inclusion of the indigenous communities in nation-wide social and political conversations from which they are now excluded via the digital divide. Also, it can have a tremendous impact on the preservation of the past and present versions of originary languages as well as enabling the visibility of these cultures.
PLN Indígenas, 2020
The availability of lexical resources is a cornerstone for endangered languages preservation and ... more The availability of lexical resources is a cornerstone for endangered languages preservation and documentation, they also constitute a primary source for language teaching and revitalization. For instance, Mexico has around 70 indigenous languages and XX variations spoken by~7 million people, which despite its cultural importance, lack digital presence, have poor data quality, and face language extinction. To confront these circumstances we made use of text mining approaches to collect and transform existing lexical resources into language-learning resources for four endangered languages of Mexico. Finally, we present an application for such learning resources using Anki, an open-source and multi-platform
PLN Indígenas, 2021
Tradicionalmente, más de 7 millones de personas distribuidas en 68 grupos lingüísticos y aproxima... more Tradicionalmente, más de 7 millones de personas distribuidas en 68 grupos lingüísticos y aproximadamente 364 variantes lingüísticas, han sido excluidas del desarrollo tecnológico. Este efecto genera una brecha digital que, en consecuencia, produce mayores desigualdades y falta de acceso a las oportunidades para las personas pertenecientes a los pueblos originarios de México. Los hablantes de lenguas indígenas que luchan por hacer uso de tecnologías modernas —como las redes sociales— se ven forzados a utilizar plataformas que se encuentran en disonancia con su visión de la vida y donde encuentran dificultades para utilizar sus propias lenguas. Más aún, las comunidades indígenas que acceden se enfrentan a la discriminación, por lo cual terminan aislados y forzados a silenciar sus culturas y lenguas originarias con tal de lograr una integración incipiente, poco recíproca, y desfavorable en sentidos que van desde lo sociolinguistico, hasta lo tecnológico puesto que los diseños de funcionalidades y experiencia de usuario están pensados desde la lenguas y culturas mayoritarias.
Canadian Science Policy Centre Editorials, 2021
In a world where Artificial Intelligence has advanced so much, so fast, no single human should be... more In a world where Artificial Intelligence has advanced so much, so fast, no single human should be excluded for not speaking a dominant language. As of today most of the advances on AI language research has been done for a handful of dominant languages, excluding —almost completely— the thousands of minority and endangered languages of the world, making the digital divide even deeper at ever accelerating rates.