Paper page - TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for

Unsupervised Sentence Embedding Learning (original) (raw)

Published on Apr 14, 2021

Abstract

TSDAE, a combination of pre-trained Transformers and Sequential Denoising Auto-Encoders, achieves high performance in unsupervised sentence embeddings and excels in domain adaptation and pre-training, surpassing Masked Language Models.

Learning sentence embeddings often requires a large amount of labeled data. However, for most tasks and domains, labeled data is seldom available and creating it is expensive. In this work, we present a new state-of-the-art unsupervised method based on pre-trained Transformers and Sequential Denoising Auto-Encoder (TSDAE) which outperforms previous approaches by up to 6.4 points. It can achieve up to 93.1% of the performance of in-domain supervised approaches. Further, we show that TSDAE is a strong domain adaptation andpre-training method for sentence embeddings, significantly outperforming other approaches like Masked Language Model. A crucial shortcoming of previous studies is the narrow evaluation: Most work mainly evaluates on the single task of Semantic Textual Similarity (STS), which does not require any domain knowledge. It is unclear if these proposed methods generalize to other domains and tasks. We fill this gap and evaluate TSDAE and other recent approaches on four different datasets from heterogeneous domains.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2104.06979

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 50

llmware/industry-bert-contracts-v0.1 Feature Extraction • Updated May 14, 2024 • 32.2k • 17

llmware/industry-bert-insurance-v0.1 Feature Extraction • Updated May 14, 2024 • 280 • 11

llmware/industry-bert-sec-v0.1 Feature Extraction • Updated May 14, 2024 • 271 • 9

llmware/industry-bert-asset-management-v0.1 Feature Extraction • Updated May 14, 2024 • 6 • 7

Browse 50 models citing this paper

Datasets citing this paper 1

mteb/AskUbuntuDupQuestions Viewer • Updated May 4, 2025• 15.2k • 2.48k

Spaces citing this paper 2

Collections including this paper 2