A new approach for novel object image captioning based on data key multitask conformer with representative mask (original) (raw)
References
Wajid MS, Terashima-Marin H, Najafirad P, Wajid MA (2023) Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods. Engineering Reports 6(1):e12785 Article Google Scholar
Hede P, Moellic P, Bourgeoys J, Joint M, Thomas C (2004) Automatic generation of natural language descriptions for images. S.I.E.A
Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: Proceedings of the 31st international conference on machine learning, Beijing, China
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR)
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on machine learning, PMLR
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: IEEE conference on computer vision and pattern recognition (CVPR)
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: IEEE conference on computer vision and pattern recognition (CVPR)
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Yang X, Peng J, Wang Z, Xu H, Ye Q, Li C, Huang S, Huang F, Li Z, Zhang Y (2023) Transforming visual scene graphs to image captions. In: Proceedings of the 61st annual meeting of the association for computational linguistics
Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. Int J Comput Vision 50:171–184 Article Google Scholar
Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304 Article Google Scholar
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: ECCV'10: Proceedings of the 11th European conference on Computer vision
Yezhou 'YZ' Y, Ching LT, Daume H, Aloimono Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the 2011 conference on empirical methods in natural language processing
Ordonez V, Kulkarni G, Berg TL (2011) Describing images using 1 million captioned photographs. In: Neural information processing systems
Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Proceedings of the twenty-sixth AAAI conference on artificial intelligence
Ushiku Y, Harada T, Kuniyoshi Y (2011) Automatic sentence generation from images. In: MM '11: Proceedings of the 19th ACM international conference on multimedia
Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg A, Berg T, Daume H, Midge (2012) Generating image descriptions from computer vision detections. In: EACL '12: Proceedings of the 13th conference of the european chapter of the association for computational linguistics
H. Goh, N. Thome, M. Cord, J. Lim, "Learning deep hierarchical visual feature coding," IEEE Transactions on Neural Networks and Learning Systems, pp. 2212–2225, 2014.
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35:1798–1828 Article Google Scholar
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T, DeCAF (2013) A deep convolutional activation feature for generic visual recognition. In: Proceedings of the 31st international conference on machine learning
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T, Caffe (2014) Convolutional architecture for fast feature embedding. In: Proceedings of the 2014 ACM conference on multimedia
Zhang N, Ding S, Zhang J, Xue Y (2017) Research on point-wise gated deep networks. Appl Soft Comput 52:1210–1221 Article Google Scholar
Papa JP, Scheirer W, Cox DD (2015) Fine-tuning deep belief networks using harmony search. Appl Soft Comput 46:875–885 Article Google Scholar
Farabet C, Couprie C, Najman L, LeCun Y (2013) Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell 35:1915–1929 Article Google Scholar
Ijjina EP, Mohan CK (2016) Hybrid deep neural network model for human action recognition. Appl Soft Comput 46:936–952 Article Google Scholar
Wang S, Jiang Y, Chung F-L, Qian P (2015) Feedforward kernel neural networks, generalized least learning machine and its deep learning with application to image classification. Appl Soft Comput 37:125–141 Article Google Scholar
Bai S (2017) Growing random forest on deep convolutional neural networks for scene categorization. Expert Syst Appl 71:279–287 Article Google Scholar
Bahdanau D, Cho K, BengioY (20174) Neural machine translation by jointly learning to align and translate. CoRR
Cho K, Merrinboer BV, Gulcehre C (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: NIPS 2013
Fang H, Gupta S, Iandola F, Srivastava R, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, Lawrence Zitnick C, Zweig G (2015) From captions to visual concepts and back. In: IEEE conference on computer vision and pattern recognition (CVPR)
Donahue J, Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39:677–691 Article Google Scholar
Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2014) Deep captioning with multimodal recurrent neural networks. In: arXiv: Computer Vision and Pattern Recognition
Hendricks LA, Venugopalan S, Rohrbach M, Mooney RJ, Saenko K, Darrell T (2015) Deep compositional captioning: describing novel object categories without paired training data. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1–10
Karpathy A, Joulin A, Li F (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: NIPS'14: Proceedings of the 27th international conference on neural information processing systems
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Su Q, Hu J, Li Z (2024) Visual contextual relationship augmented transformer for image captioning. Appl Intell 54:4794–4813 Article Google Scholar
Lin TY, Maire M, Belongie S, Bourdev L, Girshick R, Hays J, Perona P, Ramanan D, Zitnick CL, Dollár P (2014) Microsoft COCO: common objects in context. In: European conference on computer vision
Deng J, Dong W, Socher R, Li L, Li K, Li F (2009) ImageNet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition
Krasin I, Duerig T, Alldrin N, Veit A, Abu-El-Haija S, Belongie S, Cai D, Feng Z, Ferrari V, Gomes V (2016) OpenImages: a public dataset for large-scale multi-label and multi-class image classification. arXiv:1811.00982
Venugopalan S, Hendricks LA, Rohrbach M, Mooney RJ, Darrell T, Saenko K (2017) Captioning images with diverse objects. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR)
Hu X, Yin X, Lin K, Zhang L, Gao J, Wang L, Liu Z (2021) VIVO: visual vocabulary pre-training for novel object captioning. In: Proceedings of the AAAI conference on artificial intelligence
Wu Yu, Jiang Lu, Yang Yi (2023) Switchable novel object captioner. IEEE Trans Pattern Anal Mach Intell 45(1):1162–1173 Article Google Scholar
Demirel B, Cinbis RG, Ikizler-Cinbis N (2019) Image captioning with unseen objects. In: British machine vision conference
Wu Y, Zhu L, Jiang L, Yang Y (2018) Decoupled novel object captioner. In: MM '18: proceedings of the 26th ACM international conference on Multimedia
Wei J, Li Z, Zhang C, Ma H (2024) Mining core information by evaluating semantic importance for unpaired image captioning. Neural Netw 179:106519 Article Google Scholar
Hu X, Yin X, Lin K, Wang L, Zhang L (2021) Visual vocabulary pre-training for novel object captioning. In: Proceedings of the AAAI conference on artificial intelligence
Anderson P, Fernando B, Johnson M, Gould S (2017) Guided open vocabulary image captioning with constrained beam search. In: Proceedings of the 2017 conference on empirical methods in natural language processing, Copenhagen, Denmark
Chen X, Jiang M, Zhao Q (2021) Leveraging human attention in novel object captioning. In: International joint conference on artificial intelligence
Agrawal H, Desai K, Wang Y, Chen X, Jain R, Johnson M, Batra D, Parikh D (2019) Nocaps: novel object captioning at scale. In: IEEE/CVF international conference on computer vision (ICCV)
Wang Y, Wood ID, Wan S, Johnson M (2021) ECOL-R: encouraging copying in novel object captioning with reinforcement learning. In: Proceedings of the 16th conference of the european chapter of the association for computational linguistics
Du S, Zhu H, Lin G, Wang D, Shi J (2023) Novel object captioning with semantic match from external knowledge. In: Applied sciences
Huaa P, Sun H, Haoa J, Liub C, Wanga J, Qia Q, Liaoa J (2023) Reasoning guided by a manual: context-aware image captioning with novel objects. In: 26th European conference on artificial intelligence ECAI 2023
Zheng H, Wu J, Liang R, Li Y, Li x (2019) Multi-task learning for captioning images with novel words. IET Comput Vis 13:294–301 Article Google Scholar
Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: advances in neural information processing systems
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT 2019
Li LH, Yatskar M, Yin D, Hsieh CJ, Chang KW (2019) VisualBERT: a simple and performant baseline for vision and language. arXiv:1908.03557
Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2019) VL-BERT: pre-training of generic visual-linguistic representations. arXiv:1908.08530
Fariha A (2018) Automatic image captioning using multi-task learning
Huang JT, Li J, Yu D, Deng L, Gong Y (2013) Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In: IEEE international conference on acoustics, speech, and signal processing
Tang Z, Li L, Wang D (2016) Multi-task recurrent model for speech and speaker recognition. In: Asia-Pacific signal and information processing association annual summit and conference (APSIPA), pp 1–4
Vaessen N, van Leeuwen D (2023) Towards multi-task learning of speech and speaker recognition. In: INTERSPEECH 2023
Gulati A, Qin J, Chiu C, Parmar N, Zhang Y, Yu J, Han W, Wang S, Zhang Z, Wu Y, Pang R (2020) Conformer: convolution-augmented transformer for speech recognition. In: INTERSPEECH 2020
Jocher G, Chaurasia A, Qiu J (2023) Ultralytics YOLO
Feng Q, Wu Y, Fan H, Yan C, Yang Y (2020) Cascaded revision network for novel object captioning. IEEE Trans Circuits Syst Video Technol 30:3413–3421 Article Google Scholar
Hua P, Sun H, Hao J, Liu C, Wang J, Qi Q, Liao J (2023) Reasoning guided by a manual: context-aware image captioning with novel objects. In: 26th European conference on artificial intelligence-ECAI 2023
Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F, Choi Y, Gao J (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. In: 16th European Conference