Vision-Language Pre-Training with Triple Contrastive Learning (original) (raw)
Related papers
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
ArXiv, 2021
Contrastive Visual-Linguistic Pretraining
ArXiv, 2020
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training
Proceedings of the AAAI Conference on Artificial Intelligence
FILIP: Fine-grained Interactive Language-Image Pre-Training
arXiv (Cornell University), 2021
VLDeformer: Learning Visual-Semantic Embeddings by Vision-Language Transformer Decomposing
ArXiv, 2021
A Survey of Vision-Language Pre-Trained Models
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence
Adaptive Cross-Modal Embeddings for Image-Text Alignment
Proceedings of the AAAI Conference on Artificial Intelligence, 2020
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
arXiv (Cornell University), 2022
ArXiv, 2021
Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning
arXiv (Cornell University), 2023
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions
arXiv (Cornell University), 2020
cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation
Cornell University - arXiv, 2022
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders
ArXiv, 2020
Unifying Vision-Language Representation Space with Single-Tower Transformer
Proceedings of the AAAI Conference on Artificial Intelligence
Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching
arXiv (Cornell University), 2023
Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks
2017 IEEE International Conference on Computer Vision (ICCV), 2017
Learning to Scale Multilingual Representations for Vision-Language Tasks
Computer Vision – ECCV 2020, 2020
Language Features Matter: Effective Language Representations for Vision-Language Tasks
2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019
Cross-Modality Relevance for Reasoning on Language and Vision
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020
ArXiv, 2021
Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions
2020
LAION-5B: An open large-scale dataset for training next generation image-text models
Cornell University - arXiv, 2022
Bridging the Gap between Recognition-level Pre-training and Commonsensical Vision-language Tasks
Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022)
Cross-lingual Visual Pre-training for Multimodal Machine Translation
arXiv (Cornell University), 2021
VLDeformer: Vision–Language Decomposed Transformer for fast cross-modal retrieval
Knowledge-Based Systems
Improving the Cross-Lingual Generalisation in Visual Question Answering
Cornell University - arXiv, 2022
Cross-Modal Common Representation Learning with Triplet Loss Functions
arXiv (Cornell University), 2024
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval
2022
Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval
ArXiv, 2020
Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering
IEEE Transactions on Neural Networks and Learning Systems, 2022
Cross-modal learning with prior visual relation knowledge
Knowledge-Based Systems, 2020
Multimodal Convolutional Neural Networks for Matching Image and Sentence
2015 IEEE International Conference on Computer Vision (ICCV), 2015
Learning Fused Representations for Large-Scale Multimodal Classification
IEEE Sensors Letters, 2018