Text-to-image person retrieval with implicit relation alignment and contrastive learning (original) (raw)

References

Li S, Xiao T, Li H, Zhou B, Yue D, Wang X (2017) Person search with natural language description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1970–1979
Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen Y-D (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimedia Comput Commun Appl 16(2):1–23
Article Google Scholar
Chen Y, Huang R, Chang H, Tan C, Xue T, Ma B (2021) Cross-modal knowledge adaptation for language-based person search. IEEE Trans Image Process 30:4057–4069
Article Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Sarafianos N, Xu X, Kakadiaris IA (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 5814–5824
Zhang Y, Lu H (2018) Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 686–701
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Chen Y, Zhang G, Lu Y, Wang Z, Zheng Y (2022) Tipcb: a simple but effective part-based convolutional baseline for text-based person search. Neurocomputing 494:171–181
Article Google Scholar
Wang Z, Fang Z, Wang J, Yang Y (2020) Vitaa: visual-textual attributes alignment in person search by natural language. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pp 402–420. Springer
Wang Z, Zhu A, Xue J, Wan X, Liu C, Wang T, Li Y (2022) Caibc: capturing all-round information beyond color for text-based person retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia, pp 5314–5322
Wu Y, Yan Z, Han X, Li G, Zou C, Cui S (2021) Lapscore: language-guided person search via color reasoning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1624–1633
Ding Z, Ding C, Shao Z, Tao D (2021) Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv preprint arXiv:2107.12666
Farooq A, Awais M, Kittler J, Khalid SS (2022) Axm-net: implicit cross-modal feature alignment for person re-identification. Proc AAAI Conf Artif Intell 36:4477–4485
Google Scholar
Shao Z, Zhang X, Fang M, Lin Z, Wang J, Ding C (2022) Learning granularity-unified representations for text-to-image person re-identification. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 5566–5574
Suo W, Sun M, Niu K, Gao Y, Wang P, Zhang Y, Wu Q (2022) A simple and robust correlation filtering method for text-based person search. In: European Conference on Computer Vision, pp 726–742. Springer
Han X, He S, Zhang L, Xiang T (2021) Text-based person search with limited data. CoRR arXiv:abs/2110.10807
Yan S, Dong N, Zhang L, Tang J (2022) Clip-driven fine-grained text-image person re-identification. arxiv (2022). arXiv preprint arXiv:2210.10276
Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 154–162
Faghri F, Fleet DJ, Kiros JR, Fidler S (2017) Vse++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612
Dutton B (2020) Adversarial canonical correlation analysis. arXiv preprint arXiv:2005.10349
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Article Google Scholar
Ji Z, Wang H, Han J, Pang Y (2019) Saliency-guided attention network for image-sentence matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5754–5763
Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216
Wang Y, Yang H, Qian X, Ma L, Lu J, Li B, Fan X (2019) Position focused attention network for image-text matching. arXiv preprint arXiv:1907.09748
Liu C, Mao Z, Liu A-A, Zhang T, Wang B, Zhang Y (2019) Focus your attention: a bidirectional focal attention network for image-text matching. In: Proceedings of the 27th ACM International Conference on Multimedia, pp 3–11
Ge X, Chen F, Xu S, Tao F, Jose JM (2023) Cross-modal semantic enhanced interaction for image-sentence retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 1022–1031
Liu C, Mao Z, Zhang T, Xie H, Wang B, Zhang Y (2020) Graph structured network for image-text matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10921–10930
Li Z, Guo C, Feng Z, Hwang J-N, Xue X (2022) Multi-view visual semantic embedding. In: IJCAI 2:7
Google Scholar
Pan Z, Wu F, Zhang B (2023) Fine-grained image-text matching by cross-modal hard aligning network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19275–19284
Li Z, Ling F, Zhang C, Ma H (2020) Combining global and local similarity for cross-media retrieval. IEEE Access 8:21847–21856
Article Google Scholar
Diao H, Zhang Y, Ma L, Lu H (2021) Similarity reasoning and filtration for image-text matching. Proc AAAI Conf Artif Intell 35:1218–1226
Google Scholar
Wen K, Gu X, Cheng Q (2020) Learning dual semantic relations with graph attention for image-text matching. IEEE Trans Circuits Syst Video Technol 31(7):2866–2879
Article Google Scholar
Gao P, Geng S, Zhang R, Ma T, Fang R, Zhang Y, Li H, Qiao Y (2024) Clip-adapter: better vision-language models with feature adapters. Int J Comput Vis 132(2):581–595
Article Google Scholar
Wang Q, Chen W-j, Li B, Su J, Wang G, Song Q (2025) Heclip: histology-enhanced contrastive learning for imputation of transcriptomics profiles. arXiv preprint arXiv:2501.14948
Nie Y, He W, Han K, Tang Y, Guo T, Du F, Wang Y (2023) Lightclip: learning multi-level interaction for lightweight vision-language models. arXiv preprint arXiv:2312.00674
Jiang D, Ye M (2023) Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2787–2797
Shu X, Wen W, Wu H, Chen K, Song Y, Qiao R, Ren B, Wang X (2022) See finer, see more: implicit modality alignment for text-based person retrieval. In: European Conference on Computer Vision, pp. 624–641. Springer
Zhu A, Wang Z, Li Y, Wan X, Jin J, Wang T, Hu F, Hua G (2021) Dssl: deep surroundings-person separation learning for text-based person retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 209–217
Wang Z, Zhu A, Xue J, Wan X, Liu C, Wang T, Li Y (2022) Look before you leap: improving text-based person retrieval by learning a consistent cross-modal common manifold. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 1984–1992
Li S, Cao M, Zhang M (2022) Learning semantic-aligned feature representation for text-based person search. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 2724–2728. IEEE
Yan S, Tang H, Zhang L, Tang J (2023) Image-specific information suppression and implicit local alignment for text-based person search. IEEE Transactions on Neural Networks and Learning Systems

Download references