Vision-Language Pre-Training with Triple Contrastive Learning (original) (raw)

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Jingyuan Wen

ArXiv, 2021

View PDFchevron_right

Contrastive Visual-Linguistic Pretraining

zhengkai jiang

ArXiv, 2020

View PDFchevron_right

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

Yuejian Fang

Proceedings of the AAAI Conference on Artificial Intelligence

View PDFchevron_right

FILIP: Fine-grained Interactive Language-Image Pre-Training

Zhenguo Li

arXiv (Cornell University), 2021

View PDFchevron_right

VLDeformer: Learning Visual-Semantic Embeddings by Vision-Language Transformer Decomposing

Yimeng Deng

ArXiv, 2021

View PDFchevron_right

A Survey of Vision-Language Pre-Trained Models

Yifan Du

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence

View PDFchevron_right

Adaptive Cross-Modal Embeddings for Image-Text Alignment

Camila Kolling

Proceedings of the AAAI Conference on Artificial Intelligence, 2020

View PDFchevron_right

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

Owais Mohammed

arXiv (Cornell University), 2022

View PDFchevron_right

LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation

Mohammad Abuzar Shaikh

ArXiv, 2021

View PDFchevron_right

Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning

Sơn Trần -

arXiv (Cornell University), 2023

View PDFchevron_right

Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions

Zhecan Wang

arXiv (Cornell University), 2020

View PDFchevron_right

cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation

Devansh Gautam

Cornell University - arXiv, 2022

View PDFchevron_right

XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding

Yu Tsao

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

View PDFchevron_right

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

Fabrizio Falchi

ArXiv, 2020

View PDFchevron_right

Unifying Vision-Language Representation Space with Single-Tower Transformer

Chaerin Kong

Proceedings of the AAAI Conference on Artificial Intelligence

View PDFchevron_right

Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching

dasol Hwang

arXiv (Cornell University), 2023

View PDFchevron_right

Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

Tanmay Gupta

2017 IEEE International Conference on Computer Vision (ICCV), 2017

View PDFchevron_right

Learning to Scale Multilingual Representations for Vision-Language Tasks

Derry Wijaya

Computer Vision – ECCV 2020, 2020

View PDFchevron_right

Language Features Matter: Effective Language Representations for Vision-Language Tasks

Reuben Tan

2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019

View PDFchevron_right

Cross-Modality Relevance for Reasoning on Language and Vision

Parisa Kordjamshidi

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020

View PDFchevron_right

VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining

Mohamed Elhoseiny

ArXiv, 2021

View PDFchevron_right

Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions

Zhecan Wang

2020

View PDFchevron_right

LAION-5B: An open large-scale dataset for training next generation image-text models

Romain Beaumont

Cornell University - arXiv, 2022

View PDFchevron_right

Bridging the Gap between Recognition-level Pre-training and Commonsensical Vision-language Tasks

Zhecan Wang

Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022)

View PDFchevron_right

Cross-lingual Visual Pre-training for Multimodal Machine Translation

Menekse Kuyu

arXiv (Cornell University), 2021

View PDFchevron_right

VLDeformer: Vision–Language Decomposed Transformer for fast cross-modal retrieval

Yimeng Deng

Knowledge-Based Systems

View PDFchevron_right

Improving the Cross-Lingual Generalisation in Visual Question Answering

Farhad Nooralahzadeh

Cornell University - arXiv, 2022

View PDFchevron_right

Cross-Modal Common Representation Learning with Triplet Loss Functions

Felix Ott

View PDFchevron_right

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Mohammad Hossein Sekhavat

arXiv (Cornell University), 2024

View PDFchevron_right

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

Simran Tiwari

2022

View PDFchevron_right

Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

parvin razzaghi

ArXiv, 2020

View PDFchevron_right

Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

Jianbing Shen

IEEE Transactions on Neural Networks and Learning Systems, 2022

View PDFchevron_right

Cross-modal learning with prior visual relation knowledge

Zengchang Qin

Knowledge-Based Systems, 2020

View PDFchevron_right

Multimodal Convolutional Neural Networks for Matching Image and Sentence

Hang Li

2015 IEEE International Conference on Computer Vision (ICCV), 2015

View PDFchevron_right

Learning Fused Representations for Large-Scale Multimodal Classification

Shah Nawaz Bhutto

IEEE Sensors Letters, 2018

View PDFchevron_right