Paper page - Language-agnostic BERT Sentence Embedding (original) (raw)
Abstract
The study investigates effective methods for learning multilingual sentence embeddings using a combination of monolingual and cross-lingual techniques, achieving high bi-text retrieval accuracy across numerous languages.
While BERT is an effective method for learning monolingual sentence embeddings for semantic similarity and embedding based transfer learning (Reimers and Gurevych, 2019), BERT based cross-lingual sentence embeddings have yet to be explored. We systematically investigate methods for learning multilingual sentence embeddings by combining the best methods for learning monolingual and cross-lingual representations including: masked language modeling (MLM), translation language modeling (TLM) (Conneau and Lample, 2019),dual encoder translation ranking (Guo et al., 2018), and additive margin softmax (Yang et al., 2019a). We show that introducing a pre-trainedmultilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%. Composing the best of these methods produces a model that achieves 83.7% bi-text retrievalaccuracy over 112 languages on Tatoeba, well above the 65.5% achieved by Artetxe and Schwenk (2019b), while still performing competitively on monolingual transfer learning benchmarks (Conneau and Kiela, 2018). Parallel data mined from CommonCrawl using our best model is shown to train competitiveNMT models for en-zh and en-de. We publicly release our best multilingual sentence embedding model for 109+ languages at https://tfhub.dev/google/LaBSE.
View arXiv page View PDF Add to collection
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Get this paper in your agent:
hf papers read 2007.01852
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash
Models citing this paper 8
cointegrated/LaBSE-en-ru Feature Extraction • 0.1B • Updated Mar 28, 2024 • 7.79k • 59
setu4993/LaBSE Sentence Similarity • 0.5B • Updated Oct 19, 2023 • 12.7k • 54
Blaxzter/LaBSE-sentence-embeddings Sentence Similarity • 0.5B • Updated May 4, 2023 • 33 • 19
setu4993/smaller-LaBSE Sentence Similarity • 0.2B • Updated Oct 19, 2023 • 493 • 13
Browse 8 models citing this paper
Datasets citing this paper 4
ai-forever/MERA Viewer • Updated Sep 24, 2024• 85k • 670 • 19
csebuetnlp/squad_bn Updated Sep 10, 2024 • 80 • 6
csebuetnlp/xnli_bn Viewer • Updated Aug 21, 2022• 389k • 77 • 3
csebuetnlp/dailydialogue_bn Viewer • Updated Jul 22, 2023• 13.1k • 21 • 5