Cross-Lingual NLP and Text Mining Resources (original) (raw)

Introduction

Most natural language processing and text mining systems have mainly been designed to operate on English language text. Yet, the majority of people online do not speak English, and as more and more people gain access to digital tools, there is an increasing need for tools that operate on other languages.

In some cases, it is feasible to develop new customized systems for specific populations. E.g., there are many tools that have been developed for te Chinese market. However, it is challenging to achieve this for all of the world's over 7,000 languages.

Self-Learning with Adversarial Perturbation for Cross-Lingual NLP

Cross-lingual NLP and text mining technology is an important alternative in such situations. The idea is to develop a system based on one or more languages for which many resources are available, yet be able to use it in many new languages by connecting their linguistic representations.

Induced Code-Switching Datasets

We are releasing the cross-lingual benchmark database from our SIGIR 2020 paper. In this benchmark, a text classification model needs to be trained on regular English training data, but is evaluated on documents that contain an automatically induced form of code-switching, i.e., many of the original English words have been replaced by non-English words.

Example:

Clorox Co disait it déclarés a two-for-one stock scission et autorisée a 10.3 pourcentage accroissement in la annuel dividende rate. The société disait la trimestriels liquidités dividende était boosted to 0.64perpartagezdepuis0.64 per partagez depuis 0.64perpartagezdepuis0.58 on a pre-split basis, exigibles August 15 to actionnariat of enregistrer on July 28. The société disait ce sera be la 21st consécutive annuel accroissement in la dividend. The additionnel partages résultant depuis la scission sera be distribuées on September 2 to actionnariat of enregistrer on July 28, la société said. Clorox is a fabricant of ménage épicerie produits et automobile nettoyage produits commercialisés in la United States et internationally.