Christian E . Maldonado-Sifuentes | Consejo Nacional de Ciencia y Tecnología (original) (raw)
Uploads
Drafts by Christian E . Maldonado-Sifuentes
Lecture Notes in Computer Science, 2023
This paper offers a comparative analysis of two state-of-the-art machine translation models for S... more This paper offers a comparative analysis of two state-of-the-art machine translation models for Spanish to Indigenous languages of Colombia and Mexico, with the aim of investigating their effectiveness and limitations under low-resource conditions. Our methodology involved aligning verse pairs text using the Bible for twelve Indigenous languages and constructing parallel datasets for evaluation using BLEU and ROUGE metrics. The results demonstrate that transformer-based models can deliver competitive performance in translating from Spanish to Indigenous languages with minimal configuration. In particular, we found the Opus-based model obtained the best performance in 11 of the languages in the test set but, the Fairseq model performs competitively in scenarios where training data is more scarce. Additionally, we provide a comprehensive analysis of the findings, including insights into the strengths and limitations of the models. Finally, we suggest potential directions for future research in low-resource language translation, specifically in the context of Latin American indigenous languages.
Account: ● follo[ w ]ers ● follow[ i ]ng (friends) ● lis[ t ]ed Influence of account: inf ... more Account: ● follo[ w ]ers ● follow[ i ]ng (friends) ● lis[ t ]ed Influence of account: inf acct = w+w*((ln(w/i))100)+10t Tweet (Original, no RT): ● [ f ]avorite ● [ r ]etweeted Influence of tweet: inf twt = r+f+r*((ln(f/r))100)+f*((ln(f/r))100) Absolute Tweet Influence (independent of account): inf abs = inf twt / inf acct
Papers by Christian E . Maldonado-Sifuentes
Polibits, 2024
In the present manuscript, we introduce a novel and holistic architecture for the ProtoAGI system... more In the present manuscript, we introduce a novel and holistic architecture for the ProtoAGI system, conceptualized from a systems engineering standpoint. This architecture is elaborately crafted to emulate artificial general intelligence (AGI) through the integration of diverse components and knowledge frameworks, thereby augmenting its performance and adaptability. We meticulously delineate the system's proficiency in processing intricate user inputs, its capacity for adaptive learning from historical datasets, and its ability to generate responses that are contextually relevant. The cornerstone of our proposition is the intricate orchestration of Large Language Models (LLMs), task-specific solvers, and a comprehensive knowledge repository, which collectively propel the system towards achieving genuine adaptability and autonomous learning capabilities. This approach not only signifies a pioneering venture into the realm of AGI system design but also lays the groundwork for subsequent advancements in this field.
Studies in fuzziness and soft computing, 2023
Lecture Notes in Computer Science, 2021
arXiv (Cornell University), May 27, 2023
Distribute the primes into groups in ascending order, with the n-th group having prime(n) element... more Distribute the primes into groups in ascending order, with the n-th group having prime(n) elements. Then a(n) is the sum of the numbers in the n-th group times the number of elements in the group.
CEUR Workshop Proceedings, 2019
This work explores the exploitation of pre-processing, feature extraction and the averaged combin... more This work explores the exploitation of pre-processing, feature extraction and the averaged combination of Support Vector Machines (SVM) outputs for the open-set Cross-Domain Authorship Attribution task. The use of punctuation n-grams as a feature representation of a document is introduced for the Authorship Attribution in combination with traditional character n-grams. Starting from different feature representations of a document, several SVM are trained to represent the probability of membership for a certain author to latter obtain an average of all the SVM results. This approach managed to obtain 0.642 with the Macro F1-score for the PAN 2019 contest of open-set Cross-Domain Authorship Attribution.
Research in Computing Science, 2016
Advances in Soft Computing, 2021
OEIS, 2018
Distribute the primes into groups in ascending order, with the n-th group having prime(n) element... more Distribute the primes into groups in ascending order, with the n-th group having prime(n) elements. Then a(n) is the sum of the numbers in the n-th group times the number of elements in the group.
PAN at CLEF, 2019
This work explores the exploitation of pre-processing, feature extraction and the averaged combin... more This work explores the exploitation of pre-processing, feature extraction and the averaged combination of Support Vector Machines (SVM) outputs for the open-set Cross-Domain Authorship Attribution task. The use of punctuation n-grams as a feature representation of a document is introduced for the Authorship Attribution in combination with traditional character n-grams. Starting from different feature representations of a document, several SVM are trained to represent the probability of membership for a certain author to latter obtain an average of all the SVM results. This approach managed to obtain 0.642 with the Macro F1-score for the PAN 2019 contest of open-set Cross-Domain Authorship Attribution.
Research In Computing Science, 2017
The growing relevance of social media for the communication strategies of public entities-includi... more The growing relevance of social media for the communication strategies of public entities-including governmental, political, commercial and non-profit organizations, as well as personalities-has been widely accepted in recent years. This paper assumes that the lack of a method for delivering content to the general public may diminish the reach and impact of the communication strategies of any such entities. It is commonly accepted that the general public decides which contents to interact with, mostly on an emotional basis. Thus, such a method should be more effective if common psychological factors are taken into account. The method proposed in this paper has been developed and tested empirically over the course of four years working with over twenty NGOs, political parties, local personalities and small companies, in different maturity stages varying from complete lack of knowledge, to highly developed teams dedicated to content creation. It encompasses the process starting from the creation of the entity, the setup of the necessary tools going into greater detail for the specifics of content creation including emotional titles for the content and the assumption that people engages better with the content-and eventually with the entity that delivers it-when they perceive their own power can be increased though that interaction; finally it defines actions to retain the public attracted by the content. The method has been applied on different entities achieving consistent improvements, having the higher impact on entities that had more freedom to apply the method and less initial knowledge and infrastructure. Therefore it can be concluded that the initial assumptions are apparently valid and the application of a method that guides the entity through the process taking into account common psychological factors of the public is effective in improving the engagement of it with the content produced by the entities.
NLP Indígenas, 2020
Abstract Introduction For a long time indigenous languages, in general, and Mexican, in particula... more Abstract
Introduction
For a long time indigenous languages, in general, and Mexican, in particular, have been a niche research area. Different disciplines, from Linguistics, anthropology, and sociology, to —in a lesser fashion— Computational Linguistics and Natural Language Processing, are traditional stakeholders on the topic. Still, the cost-benefit ratio is not favorable in the realm of computer sciences because of the scarcity of sources, divergence in variants of the same languages, lack of standard orthographies, and low topic awareness, as well as the perception of narrowness in the realm of applications, and lack of impact for these. Thus, NLP/CL researchers have remained on the fringes.
Development.
We believe that shining a different light on these apparent obstacles and dissuasion elements can help to improve the perception of their relevance and interest as research objects. On the one hand, speakers of these languages amount to approximately seven million in Mexico, and self-identified descendants of the Mexican indigenous cultures to almost four times as many, not a negligible number. On the other hand, the study of low-resource NLP is interesting in itself as a research problem, but it also can help develop tools for the social sciences and receive valuable insights from them, fostering an environment of multidisciplinary studies that, in turn, lead to the preservation, inclusion, cultural identity exaltation, and enhanced visibility of Mexican indigenous cultures. As we perceive sparks of renewed interest in low-resource languages amongst the researchers of the NLP/CL community assessing the current state we believe that an assessment of the present challenges for the area of study is pertinent. With this, we move on to develop a systemic analysis of the field and proceed to outline a technical roadmap proposal towards the inclusion of Mexican indigenous languages in NLP/CL mainstream research. Briefly, this roadmap has two main areas that eventually converge: technical and linguistic which coalesce through the multidisciplinary paradigm. Within these areas, several tasks are proposed both in parallel and in succession, amongst the most notorious are the formation of teams of linguists, social anthropologists, and computer scientists to create tools for the collection of languages that are socially acceptable within the culture and standards of the communities. Technically, the development of algorithms for small corpora and linguistically the development of dictionaries of equivalent orthographies for distinctive variants of these languages.
Conclusion
Even though the current panorama for the indigenous languages in NLP/CL research is bleak, the re-emergence of the focus on the exaltation and preservation of originary cultures and the push for multidisciplinary studies creates a fertile environment for new approaches towards the inclusion of these languages in mainstream NLP/CL research. The implications of the development of applied research in this area can build bridges for the inclusion of the indigenous communities in nation-wide social and political conversations from which they are now excluded via the digital divide. Also, it can have a tremendous impact on the preservation of the past and present versions of originary languages as well as enabling the visibility of these cultures.
PLN Indígenas, 2020
The availability of lexical resources is a cornerstone for endangered languages preservation and ... more The availability of lexical resources is a cornerstone for endangered languages preservation and documentation, they also constitute a primary source for language teaching and revitalization. For instance, Mexico has around 70 indigenous languages and XX variations spoken by~7 million people, which despite its cultural importance, lack digital presence, have poor data quality, and face language extinction. To confront these circumstances we made use of text mining approaches to collect and transform existing lexical resources into language-learning resources for four endangered languages of Mexico. Finally, we present an application for such learning resources using Anki, an open-source and multi-platform
Conference Presentations by Christian E . Maldonado-Sifuentes
PLN Indígenas, 2021
Tradicionalmente, más de 7 millones de personas distribuidas en 68 grupos lingüísticos y aproxima... more Tradicionalmente, más de 7 millones de personas distribuidas en 68 grupos lingüísticos y aproximadamente 364 variantes lingüísticas, han sido excluidas del desarrollo tecnológico. Este efecto genera una brecha digital que, en consecuencia, produce mayores desigualdades y falta de acceso a las oportunidades para las personas pertenecientes a los pueblos originarios de México. Los hablantes de lenguas indígenas que luchan por hacer uso de tecnologías modernas —como las redes sociales— se ven forzados a utilizar plataformas que se encuentran en disonancia con su visión de la vida y donde encuentran dificultades para utilizar sus propias lenguas. Más aún, las comunidades indígenas que acceden se enfrentan a la discriminación, por lo cual terminan aislados y forzados a silenciar sus culturas y lenguas originarias con tal de lograr una integración incipiente, poco recíproca, y desfavorable en sentidos que van desde lo sociolinguistico, hasta lo tecnológico puesto que los diseños de funcionalidades y experiencia de usuario están pensados desde la lenguas y culturas mayoritarias.
Talks by Christian E . Maldonado-Sifuentes
Canadian Science Policy Centre Editorials, 2021
In a world where Artificial Intelligence has advanced so much, so fast, no single human should be... more In a world where Artificial Intelligence has advanced so much, so fast, no single human should be excluded for not speaking a dominant language. As of today most of the advances on AI language research has been done for a handful of dominant languages, excluding —almost completely— the thousands of minority and endangered languages of the world, making the digital divide even deeper at ever accelerating rates.
Lecture Notes in Computer Science, 2023
This paper offers a comparative analysis of two state-of-the-art machine translation models for S... more This paper offers a comparative analysis of two state-of-the-art machine translation models for Spanish to Indigenous languages of Colombia and Mexico, with the aim of investigating their effectiveness and limitations under low-resource conditions. Our methodology involved aligning verse pairs text using the Bible for twelve Indigenous languages and constructing parallel datasets for evaluation using BLEU and ROUGE metrics. The results demonstrate that transformer-based models can deliver competitive performance in translating from Spanish to Indigenous languages with minimal configuration. In particular, we found the Opus-based model obtained the best performance in 11 of the languages in the test set but, the Fairseq model performs competitively in scenarios where training data is more scarce. Additionally, we provide a comprehensive analysis of the findings, including insights into the strengths and limitations of the models. Finally, we suggest potential directions for future research in low-resource language translation, specifically in the context of Latin American indigenous languages.
Account: ● follo[ w ]ers ● follow[ i ]ng (friends) ● lis[ t ]ed Influence of account: inf ... more Account: ● follo[ w ]ers ● follow[ i ]ng (friends) ● lis[ t ]ed Influence of account: inf acct = w+w*((ln(w/i))100)+10t Tweet (Original, no RT): ● [ f ]avorite ● [ r ]etweeted Influence of tweet: inf twt = r+f+r*((ln(f/r))100)+f*((ln(f/r))100) Absolute Tweet Influence (independent of account): inf abs = inf twt / inf acct
Polibits, 2024
In the present manuscript, we introduce a novel and holistic architecture for the ProtoAGI system... more In the present manuscript, we introduce a novel and holistic architecture for the ProtoAGI system, conceptualized from a systems engineering standpoint. This architecture is elaborately crafted to emulate artificial general intelligence (AGI) through the integration of diverse components and knowledge frameworks, thereby augmenting its performance and adaptability. We meticulously delineate the system's proficiency in processing intricate user inputs, its capacity for adaptive learning from historical datasets, and its ability to generate responses that are contextually relevant. The cornerstone of our proposition is the intricate orchestration of Large Language Models (LLMs), task-specific solvers, and a comprehensive knowledge repository, which collectively propel the system towards achieving genuine adaptability and autonomous learning capabilities. This approach not only signifies a pioneering venture into the realm of AGI system design but also lays the groundwork for subsequent advancements in this field.
Studies in fuzziness and soft computing, 2023
Lecture Notes in Computer Science, 2021
arXiv (Cornell University), May 27, 2023
Distribute the primes into groups in ascending order, with the n-th group having prime(n) element... more Distribute the primes into groups in ascending order, with the n-th group having prime(n) elements. Then a(n) is the sum of the numbers in the n-th group times the number of elements in the group.
CEUR Workshop Proceedings, 2019
This work explores the exploitation of pre-processing, feature extraction and the averaged combin... more This work explores the exploitation of pre-processing, feature extraction and the averaged combination of Support Vector Machines (SVM) outputs for the open-set Cross-Domain Authorship Attribution task. The use of punctuation n-grams as a feature representation of a document is introduced for the Authorship Attribution in combination with traditional character n-grams. Starting from different feature representations of a document, several SVM are trained to represent the probability of membership for a certain author to latter obtain an average of all the SVM results. This approach managed to obtain 0.642 with the Macro F1-score for the PAN 2019 contest of open-set Cross-Domain Authorship Attribution.
Research in Computing Science, 2016
Advances in Soft Computing, 2021
OEIS, 2018
Distribute the primes into groups in ascending order, with the n-th group having prime(n) element... more Distribute the primes into groups in ascending order, with the n-th group having prime(n) elements. Then a(n) is the sum of the numbers in the n-th group times the number of elements in the group.
PAN at CLEF, 2019
This work explores the exploitation of pre-processing, feature extraction and the averaged combin... more This work explores the exploitation of pre-processing, feature extraction and the averaged combination of Support Vector Machines (SVM) outputs for the open-set Cross-Domain Authorship Attribution task. The use of punctuation n-grams as a feature representation of a document is introduced for the Authorship Attribution in combination with traditional character n-grams. Starting from different feature representations of a document, several SVM are trained to represent the probability of membership for a certain author to latter obtain an average of all the SVM results. This approach managed to obtain 0.642 with the Macro F1-score for the PAN 2019 contest of open-set Cross-Domain Authorship Attribution.
Research In Computing Science, 2017
The growing relevance of social media for the communication strategies of public entities-includi... more The growing relevance of social media for the communication strategies of public entities-including governmental, political, commercial and non-profit organizations, as well as personalities-has been widely accepted in recent years. This paper assumes that the lack of a method for delivering content to the general public may diminish the reach and impact of the communication strategies of any such entities. It is commonly accepted that the general public decides which contents to interact with, mostly on an emotional basis. Thus, such a method should be more effective if common psychological factors are taken into account. The method proposed in this paper has been developed and tested empirically over the course of four years working with over twenty NGOs, political parties, local personalities and small companies, in different maturity stages varying from complete lack of knowledge, to highly developed teams dedicated to content creation. It encompasses the process starting from the creation of the entity, the setup of the necessary tools going into greater detail for the specifics of content creation including emotional titles for the content and the assumption that people engages better with the content-and eventually with the entity that delivers it-when they perceive their own power can be increased though that interaction; finally it defines actions to retain the public attracted by the content. The method has been applied on different entities achieving consistent improvements, having the higher impact on entities that had more freedom to apply the method and less initial knowledge and infrastructure. Therefore it can be concluded that the initial assumptions are apparently valid and the application of a method that guides the entity through the process taking into account common psychological factors of the public is effective in improving the engagement of it with the content produced by the entities.
NLP Indígenas, 2020
Abstract Introduction For a long time indigenous languages, in general, and Mexican, in particula... more Abstract
Introduction
For a long time indigenous languages, in general, and Mexican, in particular, have been a niche research area. Different disciplines, from Linguistics, anthropology, and sociology, to —in a lesser fashion— Computational Linguistics and Natural Language Processing, are traditional stakeholders on the topic. Still, the cost-benefit ratio is not favorable in the realm of computer sciences because of the scarcity of sources, divergence in variants of the same languages, lack of standard orthographies, and low topic awareness, as well as the perception of narrowness in the realm of applications, and lack of impact for these. Thus, NLP/CL researchers have remained on the fringes.
Development.
We believe that shining a different light on these apparent obstacles and dissuasion elements can help to improve the perception of their relevance and interest as research objects. On the one hand, speakers of these languages amount to approximately seven million in Mexico, and self-identified descendants of the Mexican indigenous cultures to almost four times as many, not a negligible number. On the other hand, the study of low-resource NLP is interesting in itself as a research problem, but it also can help develop tools for the social sciences and receive valuable insights from them, fostering an environment of multidisciplinary studies that, in turn, lead to the preservation, inclusion, cultural identity exaltation, and enhanced visibility of Mexican indigenous cultures. As we perceive sparks of renewed interest in low-resource languages amongst the researchers of the NLP/CL community assessing the current state we believe that an assessment of the present challenges for the area of study is pertinent. With this, we move on to develop a systemic analysis of the field and proceed to outline a technical roadmap proposal towards the inclusion of Mexican indigenous languages in NLP/CL mainstream research. Briefly, this roadmap has two main areas that eventually converge: technical and linguistic which coalesce through the multidisciplinary paradigm. Within these areas, several tasks are proposed both in parallel and in succession, amongst the most notorious are the formation of teams of linguists, social anthropologists, and computer scientists to create tools for the collection of languages that are socially acceptable within the culture and standards of the communities. Technically, the development of algorithms for small corpora and linguistically the development of dictionaries of equivalent orthographies for distinctive variants of these languages.
Conclusion
Even though the current panorama for the indigenous languages in NLP/CL research is bleak, the re-emergence of the focus on the exaltation and preservation of originary cultures and the push for multidisciplinary studies creates a fertile environment for new approaches towards the inclusion of these languages in mainstream NLP/CL research. The implications of the development of applied research in this area can build bridges for the inclusion of the indigenous communities in nation-wide social and political conversations from which they are now excluded via the digital divide. Also, it can have a tremendous impact on the preservation of the past and present versions of originary languages as well as enabling the visibility of these cultures.
PLN Indígenas, 2020
The availability of lexical resources is a cornerstone for endangered languages preservation and ... more The availability of lexical resources is a cornerstone for endangered languages preservation and documentation, they also constitute a primary source for language teaching and revitalization. For instance, Mexico has around 70 indigenous languages and XX variations spoken by~7 million people, which despite its cultural importance, lack digital presence, have poor data quality, and face language extinction. To confront these circumstances we made use of text mining approaches to collect and transform existing lexical resources into language-learning resources for four endangered languages of Mexico. Finally, we present an application for such learning resources using Anki, an open-source and multi-platform
PLN Indígenas, 2021
Tradicionalmente, más de 7 millones de personas distribuidas en 68 grupos lingüísticos y aproxima... more Tradicionalmente, más de 7 millones de personas distribuidas en 68 grupos lingüísticos y aproximadamente 364 variantes lingüísticas, han sido excluidas del desarrollo tecnológico. Este efecto genera una brecha digital que, en consecuencia, produce mayores desigualdades y falta de acceso a las oportunidades para las personas pertenecientes a los pueblos originarios de México. Los hablantes de lenguas indígenas que luchan por hacer uso de tecnologías modernas —como las redes sociales— se ven forzados a utilizar plataformas que se encuentran en disonancia con su visión de la vida y donde encuentran dificultades para utilizar sus propias lenguas. Más aún, las comunidades indígenas que acceden se enfrentan a la discriminación, por lo cual terminan aislados y forzados a silenciar sus culturas y lenguas originarias con tal de lograr una integración incipiente, poco recíproca, y desfavorable en sentidos que van desde lo sociolinguistico, hasta lo tecnológico puesto que los diseños de funcionalidades y experiencia de usuario están pensados desde la lenguas y culturas mayoritarias.
Canadian Science Policy Centre Editorials, 2021
In a world where Artificial Intelligence has advanced so much, so fast, no single human should be... more In a world where Artificial Intelligence has advanced so much, so fast, no single human should be excluded for not speaking a dominant language. As of today most of the advances on AI language research has been done for a handful of dominant languages, excluding —almost completely— the thousands of minority and endangered languages of the world, making the digital divide even deeper at ever accelerating rates.