Integrating multimodal features by a two-way co-attention mechanism for visual question answering (original) (raw)
References
Zhang C, Lu Y (2021) Study on artificial intelligence: the state of the art and future prospects. J Ind Inf Integr 23:100224 Google Scholar
Sharma H, Srivastava S (2021) Visual question-answering model based on the fusion of multimodal features by a two-way co-attention mechanism. Imaging Sci J 69(1–4):177–189 Article Google Scholar
Bhatt D, Patel C, Talsania H, Patel J, Vaghela R, Pandya S, ..., Ghayvat H (2021) CNN variants for computer vision: history, architecture, application, challenges, and future scope. Electronics 10(20):2470
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, Montreal Convention Center, Montreal, Canada, December 7–10
Schwartz I, Schwing A, Hazan T (2017) High-order attention models for visual question answering. Advances in Neural Information Processing Systems, 30. Long Beach, California, USA, December 4–9, 3667–3677
Wu Y, Ma Y, Wan S (2021) Multi-scale relation reasoning for multi-modal visual question answering. Signal Process: Image Commun 96:116319 Google Scholar
Zhang S, Chen M, Chen J, Zou F, Li YF, Lu P (2021) Multimodal feature-wise co-attention method for visual question answering. Information Fusion 73:1–10 Article Google Scholar
Dey R, Salem FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS). IEEE, pp 1597–1600
Zhan H, Xiong P, Wang X, Xin WANG, Yang L (2022) Visual question answering by pattern matching and reasoning. Neurocomputing 467:323–336 Article Google Scholar
Zheng W, Yin L, Chen X, Ma Z, Liu S, Yang B (2021) Knowledge base graph embedding module design for visual question answering model. Pattern Recogn 120:108153 Article Google Scholar
Guo W, Zhang Y, Yang J, Yuan X (2021) Re-attention for visual question answering. IEEE Trans Image Process 30:6730–6743 Article Google Scholar
Zheng X, Wang B, Du X, Lu X (2021) Mutual attention inception network for remote sensing visual question answering. IEEE Trans Geosci Remote Sens 60:1–14 Google Scholar
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, Nevada, June 26-July 1, 21–29
Ilievski I, Yan S, Feng J (2016) A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. Advances in neural information processing systems, 29, Barcelona, Spain, December 5–10, 289–297
Zhu X, Mao Z, Chen Z, Li Y, Wang Z, Wang B (2021) Object-difference drived graph convolutional networks for visual question answering. Multimed Tools Appl 80(11):16247–16265 Article Google Scholar
Gao L, Cao L, Xu X, Shao J, Song J (2020) Question-led object attention for visual question answering. Neurocomputing 391:227–233 Article Google Scholar
Singh A, Natarajan V, Shah M, Jiang Y, Chen X, Batra D, ..., Rohrbach M (2019) Towards vqa models that can read. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, June 16–20, 8317–8326
Cadene R, Ben-Younes H, Cord M, Thome N (2019) Murel: multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, June 16–20, 1989–1998
Zhang W, Yu J, Hu H, Hu H, Qin Z (2020) Multimodal feature fusion by relational reasoning and attention for visual question answering. Information Fusion 55:116–126 Article Google Scholar
Sharma H, Jalal AS (2022) An improved attention and hybrid optimization technique for visual question answering. Neural Process Lett 54(1):709–730 Article Google Scholar
Zhang X, Wu C, Zhao Z, Lin W, Zhang Y, Wang Y, Xie W (2023) PMC-VQA: visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415
Jiang H, Misra I, Rohrbach M, Learned-Miller E, Chen X (2020) In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 10267–10276
Chen L, Yan X, Xiao J, Zhang H, Pu S, Zhuang Y (2020) Counterfactual samples synthesizing for robust visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 10800–10809
Sharma H, Jalal AS (2022) Image captioning improved visual question answering. Multimedia tools and applications 81(24):34775–34796
Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 6281–6290
Li L, Gan Z, Cheng Y, Liu J (2019) Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision. pp 10313–10322
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, ..., Fei-Fei L (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Teney D, Anderson P, He X, Van Den Hengel A (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, Utah, June 18–22, 4223–4232
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on Empirical Methods in Natural Language Processing (EMNLP), association for computational linguistics, Doha, Qatar. 1532–1543
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: CVPR, pp 21–29
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: NIPS. pp 289–297
Kazemi V, Elqursh A (2017) Show, ask, attend, and answer: a strong baseline for visual question answering, arXiv:1704.03162v2
Nguyen D, Okatani T (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: CVPR. pp 6087–6096
Ramanan D, Pirsiavash H, Fowlkes C (2009) Bilinear classifiers for visual recognition. In: NIPS, pp 1482–1490
Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959 Article Google Scholar
O.K. Kim J., W. Lim, Hadamard product for low-rank bilinear pooling, ICLR, 2017.
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) VQA:visual question answering. In: Proc. IEEE Int. Conf. Computer Vision (ICCV). pp 2425–2433, https://doi.org/10.1109/ICCV.2015.279
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in VQA matter: elevating the role of image understanding in visual question answering. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR). pp 6325–6334. https://doi.org/10.1109/CVPR.2017.670
Lin T, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: Fleet DJ, Pajdla T, Schiele B, Tuytelaars T (eds) Computer vision - ECCV 2014 - 13th European conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V, in: Lecture Notes in Computer Science. Springer, vol. 8693, pp 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
Hariharan B, Johnson J, Maaten L, Li F-F (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR, pp 1988–1997
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Peng L, Yang Y, Wang Z, Huang Z, Shen HT (2020) Mra-net: improving vqa via multi-modal relation attention network. IEEE Trans Pattern Anal Mach Intell 44(1):318–329 Article Google Scholar
Zhou Y, Ren T, Zhu C, Sun X, Liu J, Ding X, ..., Ji R (2021) TRAR: routing the attention spans in transformer for visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, June 19–25, 2074–2084
Nam H, Ha JW, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, Hawaii, July 21–26, 299–307
Kazemi V, Elqursh A (2017) Show, ask, attend, and answer: a strong baseline for visual question answering. arXiv preprint arXiv:1704.03162
Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, Hawaii, July 21–26, 4709–4717
Wang P, Wu Q, Shen C, van den Hengel A (2017) The vqa-machine: learning how to use existing vision algorithms to answer new questions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, Hawaii, July 21–26, 1173–1182
Nguyen DK, Okatani T (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, Utah, June 18–22, 6087–6096
Yu D, Fu J, Tian X, Mei T (2019) Multi-source multi-level attention networks for visual question answering. ACM Trans Multimed Comput Commun Appl (TOMM) 15(2s):1–20 Article Google Scholar
Liu Y, Zhang X, Zhao Z, Zhang B, Cheng L, Li Z (2020) ALSA: adversarial learning of supervised attentions for visual question answering. IEEE Trans Cybern 52(6):4520–4533
Liu Y, Zhang X, Huang F, Zhou Z, Zhao Z, Li Z (2020) Visual question answering via combining inferential attention and semantic space mapping. Knowl-Based Syst 207:106339 Article Google Scholar
Peng L, Yang Y, Zhang X, Ji Y, Lu H, Shen HT (2020) Answer again: improving VQA with cascaded-answering model. IEEE Trans Knowl Data Eng 34(4):1644–1655
Li W, Sun J, Liu G, Zhao L, Fang X (2020) Visual question answering with attention transfer and a cross-modal gating mechanism. Pattern Recogn Lett 133:334–340 Article Google Scholar
Sharma H, Jalal AS (2021) Visual question answering model based on graph neural network and contextual attention. Image Vis Comput 110:104165 Article Google Scholar
Kim JJ, Lee DG, Wu J, Jung HG, Lee SW (2021) Visual question answering based on local-scene-aware referring expression generation. Neural Netw 139:158–167 Article Google Scholar
Guo D, Xu C, Tao D (2021) Bilinear graph networks for visual question answering. IEEE Trans Neural Netw Learn Syst
Yang X, Gao C, Zhang H, Cai J (2021) Auto-parsing network for image captioning and visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, June 19–25 (pp. 2197–2207)
Sharma H, Jalal AS (2022) A framework for visual question answering with the integration of scene-text using PHOCs and fisher vectors. Expert Syst Appl 190:116159 Article Google Scholar
Sharma H, Jalal AS (2022) Improving visual question answering by combining scene-text information. Multimed Tools Appl 81(9):12177–12208
Barra S, Bisogni C, De Marsico M, Ricciardi S (2021) Visual question answering: which investigated applications? Pattern Recogn Lett 151:325–331 Article Google Scholar
Gao H, Xu K, Cao M, Xiao J, Xu Q, Yin Y (2021) The deep features and attention mechanism-based method to dish healthcare under social iot systems: an empirical study with a hand-deep local–global net. IEEE Trans Comput Soc Syst 9(1):336–347 Article Google Scholar
Gao H, Xiao J, Yin Y, Liu T, Shi J (2022) A mutually supervised graph attention network for few-shot segmentation: the perspective of fully utilizing limited samples. IEEE Trans Neural Netw Learn Syst
Xiao J, Xu H, Gao H, Bian M, Li Y (2021) A weakly supervised semantic segmentation network by aggregating seed cues: the multi-object proposal generation perspective. ACM Trans Multimed Comput Commun Appl 17(1s):1–19 Article Google Scholar