UniIR: Training and Benchmarking Universal Multimodal Information Retrievers (original) (raw)

References

  1. Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
  2. Asai, A., Min, S., Zhong, Z., Chen, D.: Retrieval-based language models and applications. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pp. 41–46 (2023)
    Google Scholar
  3. Asai, A., et al.: Task-aware retrieval with instructions. Findings of ACL (2022)
    Google Scholar
  4. Blattmann, A., Rombach, R., Oktay, K., Müller, J., Ommer, B.: Semi-parametric neural image synthesis. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022)
    Google Scholar
  5. Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18392–18402 (2023)
    Google Scholar
  6. Brooks, T., et al.: Video generation models as world simulators (2024). https://openai.com/research/video-generation-models-as-world-simulators
  7. Chang, Y., Narang, M., Suzuki, H., Cao, G., Gao, J., Bisk, Y.: Webqa: multihop and multimodal QA. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16495–16504 (2022)
    Google Scholar
  8. Changpinyo, S., Pont-Tuset, J., Ferrari, V., Soricut, R.: Telling the what while pointing to the where: multimodal queries for image retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12136–12146 (2021)
    Google Scholar
  9. Chen, W., Hu, H., Saharia, C., Cohen, W.W.: Re-imagen: retrieval-augmented text-to-image generator. In: The International Conference on Learning Representations (2022)
    Google Scholar
  10. Chen, Y., et al.: Can pre-trained vision and language models answer visual information-seeking questions? In: Proceedings of Conference on Empirical Methods in Natural Language Processing (2023)
    Google Scholar
  11. Chen, Y.C., et al.: Uniter: learning universal image-text representations (2019)
    Google Scholar
  12. Chowdhery, A., et al.: Palm: scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022)
  13. Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
  14. Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning. In: Advances in Neural Information Processing Systems (2023)
    Google Scholar
  15. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: The International Conference on Learning Representations (2020)
    Google Scholar
  16. Fu, S., et al.: Dreamsim: learning new dimensions of human visual similarity using synthetic data. In: Advances in Neural Information Processing Systems (2023)
    Google Scholar
  17. Ge, Y., et al.: Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218 (2023)
  18. Girdhar, R., et al.: Imagebind: one embedding space to bind them all. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190 (2023)
    Google Scholar
  19. Han, X., et al.: Automatic spatially-aware fashion concept discovery. In: Proceedings of the IEEE International Conference on Computer Vision (2017)
    Google Scholar
  20. Hu, H., et al.: Instruct-imagen: image generation with multi-modal instruction. arXiv preprint arXiv:2401.01952 (2024)
  21. Hu, H., et al.: Open-domain visual entity recognition: towards recognizing millions of wikipedia entities. In: Proceedings of the IEEE International Conference on Computer Vision (2023)
    Google Scholar
  22. Jain, A., et al.: Mural: multimodal, multitask retrieval across languages. Findings of the Association for Computational Linguistics: EMNLP (2021)
    Google Scholar
  23. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)
    Google Scholar
  24. Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7(3), 535–547 (2019)
    Article Google Scholar
  25. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
    Google Scholar
  26. Kim, W., Son, B., Kim, I.: VILT: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583–5594. PMLR (2021)
    Google Scholar
  27. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning (2023)
    Google Scholar
  28. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
    Google Scholar
  29. Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. Adv. Neural. Inf. Process. Syst. 34, 9694–9705 (2021)
    Google Scholar
  30. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
    Chapter Google Scholar
  31. Lin, X.V., et al.: RA-DIT: retrieval-augmented dual instruction tuning. arXiv preprint arXiv:2310.01352 (2023)
  32. Liu, F., Wang, Y., Wang, T., Ordonez, V.: Visual news: benchmark and challenges in news image captioning. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2021)
    Google Scholar
  33. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (2023)
    Google Scholar
  34. Liu, S., Feng, W., Chen, W., Wang, W.Y.: EDIS: entity-driven image search over multimodal web content. In: Proceedings of Conference on Empirical Methods in Natural Language Processing (2023)
    Google Scholar
  35. Liu, Z., Xiong, C., Lv, Y., Liu, Z., Yu, G.: Universal vision-language dense retrieval: learning a unified representation space for multi-modal retrieval. In: The Eleventh International Conference on Learning Representations (2022)
    Google Scholar
  36. Liu, Z., Rodriguez-Opazo, C., Teney, D., Gould, S.: Image retrieval on real-life images with pre-trained vision-and-language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2125–2134 (2021)
    Google Scholar
  37. Luo, M., Fang, Z., Gokhale, T., Yang, Y., Baral, C.: End-to-end knowledge retrieval with multi-modal queries. In: Annual Meeting of the Association for Computational Linguistics (2023)
    Google Scholar
  38. Mishra, S., Khashabi, D., Baral, C., Hajishirzi, H.: Cross-task generalization via natural language crowdsourcing instructions. In: Annual Meeting of the Association for Computational Linguistics (2021)
    Google Scholar
  39. OpenAI: GPT-4 technical report. arXiv abs/2303.08774 (2023). https://api.semanticscholar.org/CorpusID:257532815
  40. Ouyang, L., et al.: Training language models to follow instructions with human feedback. Adv. Neural. Inf. Process. Syst. 35, 27730–27744 (2022)
    Google Scholar
  41. Pan, X., Dong, L., Huang, S., Peng, Z., Chen, W., Wei, F.: Kosmos-g: generating images in context with multimodal large language models. arXiv preprint arXiv:2310.02992 (2023)
  42. Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)
    Google Scholar
  43. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
    Google Scholar
  44. Ram, O., et al.: In-context retrieval-augmented language models. Trans. Assoc. Comput. Linguist. 11, 1316–1331 (2023)
    Article Google Scholar
  45. Sheynin, S., et al.: kNN-diffusion: image generation via large-scale retrieval. In: The Eleventh International Conference on Learning Representations (2023)
    Google Scholar
  46. Singhal, A., et al.: Modern information retrieval: a brief overview. IEEE Data Eng. Bull. 24(4), 35–43 (2001)
    Google Scholar
  47. Su, H., et al.: One embedder, any task: instruction-finetuned text embeddings. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Findings of the Association for Computational Linguistics: ACL 2023 (2023)
    Google Scholar
  48. Sun, Q., et al.: Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286 (2023)
  49. Tang, Z., Yang, Z., Khademi, M., Liu, Y., Zhu, C., Bansal, M.: Codi-2: in-context, interleaved, and interactive any-to-any generation. arXiv preprint arXiv:2311.18775 (2023)
  50. Tang, Z., Yang, Z., Zhu, C., Zeng, M., Bansal, M.: Any-to-any generation via composable diffusion. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
    Google Scholar
  51. Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
  52. Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: BEIR: a heterogenous benchmark for zero-shot evaluation of information retrieval models. In: Advances in Neural Information Processing Systems (2021)
    Google Scholar
  53. Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
  54. Wang, B., et al.: Instructretro: instruction tuning post retrieval-augmented pretraining. arXiv preprint arXiv:2310.07713 (2023)
  55. Wei, J., et al.: Finetuned language models are zero-shot learners. In: The International Conference on Learning Representations (2021)
    Google Scholar
  56. Wu, H., et al.: Fashion IQ: a new dataset towards retrieving images by natural language feedback. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11307–11317 (2021)
    Google Scholar
  57. Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.S.: Next-GPT: any-to-any multimodal LLM. arXiv preprint arXiv:2309.05519 (2023)
  58. Xu, Z., Shen, Y., Huang, L.: Multiinstruct: improving multi-modal zero-shot learning via instruction tuning. In: Annual Meeting of the Association for Computational Linguistics (2023)
    Google Scholar
  59. Yang, Z., Li, L., Lin, K., Wang, J., Lin, C.C., Liu, Z., Wang, L.: The dawn of LMMs: preliminary explorations with GPT-4V (ision). arXiv preprint arXiv:2309.17421, vol. 9, no. 1, p. 1 (2023)
  60. Yasunaga, M., et al.: Retrieval-augmented multimodal language modeling (2023)
    Google Scholar
  61. Yu, L., et al.: Scaling autoregressive multi-modal models: pretraining and instruction tuning. arXiv preprint arXiv:2309.02591 (2023)
  62. Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: Proceedings of the IEEE International Conference on Computer Vision (2023)
    Google Scholar
  63. Zhang, K., Mo, L., Chen, W., Sun, H., Su, Y.: Magicbrush: a manually annotated dataset for instruction-guided image editing. In: Advances in neural information processing systems (2023)
    Google Scholar

Download references