Asai, A., Min, S., Zhong, Z., Chen, D.: Retrieval-based language models and applications. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pp. 41–46 (2023) Google Scholar
Asai, A., et al.: Task-aware retrieval with instructions. Findings of ACL (2022) Google Scholar
Blattmann, A., Rombach, R., Oktay, K., Müller, J., Ommer, B.: Semi-parametric neural image synthesis. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022) Google Scholar
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18392–18402 (2023) Google Scholar
Chang, Y., Narang, M., Suzuki, H., Cao, G., Gao, J., Bisk, Y.: Webqa: multihop and multimodal QA. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16495–16504 (2022) Google Scholar
Changpinyo, S., Pont-Tuset, J., Ferrari, V., Soricut, R.: Telling the what while pointing to the where: multimodal queries for image retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12136–12146 (2021) Google Scholar
Chen, W., Hu, H., Saharia, C., Cohen, W.W.: Re-imagen: retrieval-augmented text-to-image generator. In: The International Conference on Learning Representations (2022) Google Scholar
Chen, Y., et al.: Can pre-trained vision and language models answer visual information-seeking questions? In: Proceedings of Conference on Empirical Methods in Natural Language Processing (2023) Google Scholar
Chen, Y.C., et al.: Uniter: learning universal image-text representations (2019) Google Scholar
Chowdhery, A., et al.: Palm: scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022)
Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning. In: Advances in Neural Information Processing Systems (2023) Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: The International Conference on Learning Representations (2020) Google Scholar
Fu, S., et al.: Dreamsim: learning new dimensions of human visual similarity using synthetic data. In: Advances in Neural Information Processing Systems (2023) Google Scholar
Ge, Y., et al.: Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218 (2023)
Girdhar, R., et al.: Imagebind: one embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190 (2023) Google Scholar
Han, X., et al.: Automatic spatially-aware fashion concept discovery. In: Proceedings of the IEEE International Conference on Computer Vision (2017) Google Scholar
Hu, H., et al.: Instruct-imagen: image generation with multi-modal instruction. arXiv preprint arXiv:2401.01952 (2024)
Hu, H., et al.: Open-domain visual entity recognition: towards recognizing millions of wikipedia entities. In: Proceedings of the IEEE International Conference on Computer Vision (2023) Google Scholar
Jain, A., et al.: Mural: multimodal, multitask retrieval across languages. Findings of the Association for Computational Linguistics: EMNLP (2021) Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021) Google Scholar
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7(3), 535–547 (2019) Article Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015) Google Scholar
Kim, W., Son, B., Kim, I.: VILT: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021) Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning (2023) Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022) Google Scholar
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. Adv. Neural. Inf. Process. Syst. 34, 9694–9705 (2021) Google Scholar
Liu, F., Wang, Y., Wang, T., Ordonez, V.: Visual news: benchmark and challenges in news image captioning. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2021) Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (2023) Google Scholar
Liu, S., Feng, W., Chen, W., Wang, W.Y.: EDIS: entity-driven image search over multimodal web content. In: Proceedings of Conference on Empirical Methods in Natural Language Processing (2023) Google Scholar
Liu, Z., Xiong, C., Lv, Y., Liu, Z., Yu, G.: Universal vision-language dense retrieval: learning a unified representation space for multi-modal retrieval. In: The Eleventh International Conference on Learning Representations (2022) Google Scholar
Liu, Z., Rodriguez-Opazo, C., Teney, D., Gould, S.: Image retrieval on real-life images with pre-trained vision-and-language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2125–2134 (2021) Google Scholar
Luo, M., Fang, Z., Gokhale, T., Yang, Y., Baral, C.: End-to-end knowledge retrieval with multi-modal queries. In: Annual Meeting of the Association for Computational Linguistics (2023) Google Scholar
Mishra, S., Khashabi, D., Baral, C., Hajishirzi, H.: Cross-task generalization via natural language crowdsourcing instructions. In: Annual Meeting of the Association for Computational Linguistics (2021) Google Scholar
Ouyang, L., et al.: Training language models to follow instructions with human feedback. Adv. Neural. Inf. Process. Syst. 35, 27730–27744 (2022) Google Scholar
Pan, X., Dong, L., Huang, S., Peng, Z., Chen, W., Wei, F.: Kosmos-g: generating images in context with multimodal large language models. arXiv preprint arXiv:2310.02992 (2023)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015) Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021) Google Scholar
Ram, O., et al.: In-context retrieval-augmented language models. Trans. Assoc. Comput. Linguist. 11, 1316–1331 (2023) Article Google Scholar
Sheynin, S., et al.: kNN-diffusion: image generation via large-scale retrieval. In: The Eleventh International Conference on Learning Representations (2023) Google Scholar
Singhal, A., et al.: Modern information retrieval: a brief overview. IEEE Data Eng. Bull. 24(4), 35–43 (2001) Google Scholar
Su, H., et al.: One embedder, any task: instruction-finetuned text embeddings. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Findings of the Association for Computational Linguistics: ACL 2023 (2023) Google Scholar
Sun, Q., et al.: Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286 (2023)
Tang, Z., Yang, Z., Zhu, C., Zeng, M., Bansal, M.: Any-to-any generation via composable diffusion. In: Advances in Neural Information Processing Systems, vol. 36 (2024) Google Scholar
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: BEIR: a heterogenous benchmark for zero-shot evaluation of information retrieval models. In: Advances in Neural Information Processing Systems (2021) Google Scholar
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Wang, B., et al.: Instructretro: instruction tuning post retrieval-augmented pretraining. arXiv preprint arXiv:2310.07713 (2023)
Wei, J., et al.: Finetuned language models are zero-shot learners. In: The International Conference on Learning Representations (2021) Google Scholar
Wu, H., et al.: Fashion IQ: a new dataset towards retrieving images by natural language feedback. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11307–11317 (2021) Google Scholar
Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.S.: Next-GPT: any-to-any multimodal LLM. arXiv preprint arXiv:2309.05519 (2023)
Xu, Z., Shen, Y., Huang, L.: Multiinstruct: improving multi-modal zero-shot learning via instruction tuning. In: Annual Meeting of the Association for Computational Linguistics (2023) Google Scholar
Yang, Z., Li, L., Lin, K., Wang, J., Lin, C.C., Liu, Z., Wang, L.: The dawn of LMMs: preliminary explorations with GPT-4V (ision). arXiv preprint arXiv:2309.17421, vol. 9, no. 1, p. 1 (2023)
Yasunaga, M., et al.: Retrieval-augmented multimodal language modeling (2023) Google Scholar
Yu, L., et al.: Scaling autoregressive multi-modal models: pretraining and instruction tuning. arXiv preprint arXiv:2309.02591 (2023)
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: Proceedings of the IEEE International Conference on Computer Vision (2023) Google Scholar
Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: a manually annotated dataset for instruction-guided image editing. In: Advances in neural information processing systems (2023) Google Scholar