A new approach for novel object image captioning based on data key multitask conformer with representative mask (original) (raw)

References

  1. Wajid MS, Terashima-Marin H, Najafirad P, Wajid MA (2023) Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods. Engineering Reports 6(1):e12785
    Article Google Scholar
  2. Hede P, Moellic P, Bourgeoys J, Joint M, Thomas C (2004) Automatic generation of natural language descriptions for images. S.I.E.A
  3. Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: Proceedings of the 31st international conference on machine learning, Beijing, China
  4. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR)
  5. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on machine learning, PMLR
  6. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: IEEE conference on computer vision and pattern recognition (CVPR)
  7. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: IEEE conference on computer vision and pattern recognition (CVPR)
  8. Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
  9. Yang X, Peng J, Wang Z, Xu H, Ye Q, Li C, Huang S, Huang F, Li Z, Zhang Y (2023) Transforming visual scene graphs to image captions. In: Proceedings of the 61st annual meeting of the association for computational linguistics
  10. Kojima A, Tamura T, Fukunaga K (2002) Natural language description of human activities from video images based on concept hierarchy of actions. Int J Comput Vision 50:171–184
    Article Google Scholar
  11. Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304
    Article Google Scholar
  12. Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: ECCV'10: Proceedings of the 11th European conference on Computer vision
  13. Yezhou 'YZ' Y, Ching LT, Daume H, Aloimono Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the 2011 conference on empirical methods in natural language processing
  14. Ordonez V, Kulkarni G, Berg TL (2011) Describing images using 1 million captioned photographs. In: Neural information processing systems
  15. Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Proceedings of the twenty-sixth AAAI conference on artificial intelligence
  16. Ushiku Y, Harada T, Kuniyoshi Y (2011) Automatic sentence generation from images. In: MM '11: Proceedings of the 19th ACM international conference on multimedia
  17. Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg A, Berg T, Daume H, Midge (2012) Generating image descriptions from computer vision detections. In: EACL '12: Proceedings of the 13th conference of the european chapter of the association for computational linguistics
  18. H. Goh, N. Thome, M. Cord, J. Lim, "Learning deep hierarchical visual feature coding," IEEE Transactions on Neural Networks and Learning Systems, pp. 2212–2225, 2014.
  19. Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35:1798–1828
    Article Google Scholar
  20. Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T, DeCAF (2013) A deep convolutional activation feature for generic visual recognition. In: Proceedings of the 31st international conference on machine learning
  21. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T, Caffe (2014) Convolutional architecture for fast feature embedding. In: Proceedings of the 2014 ACM conference on multimedia
  22. Zhang N, Ding S, Zhang J, Xue Y (2017) Research on point-wise gated deep networks. Appl Soft Comput 52:1210–1221
    Article Google Scholar
  23. Papa JP, Scheirer W, Cox DD (2015) Fine-tuning deep belief networks using harmony search. Appl Soft Comput 46:875–885
    Article Google Scholar
  24. Farabet C, Couprie C, Najman L, LeCun Y (2013) Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell 35:1915–1929
    Article Google Scholar
  25. Ijjina EP, Mohan CK (2016) Hybrid deep neural network model for human action recognition. Appl Soft Comput 46:936–952
    Article Google Scholar
  26. Wang S, Jiang Y, Chung F-L, Qian P (2015) Feedforward kernel neural networks, generalized least learning machine and its deep learning with application to image classification. Appl Soft Comput 37:125–141
    Article Google Scholar
  27. Bai S (2017) Growing random forest on deep convolutional neural networks for scene categorization. Expert Syst Appl 71:279–287
    Article Google Scholar
  28. Bahdanau D, Cho K, BengioY (20174) Neural machine translation by jointly learning to align and translate. CoRR
  29. Cho K, Merrinboer BV, Gulcehre C (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
  30. Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning
  31. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: NIPS 2013
  32. Fang H, Gupta S, Iandola F, Srivastava R, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, Lawrence Zitnick C, Zweig G (2015) From captions to visual concepts and back. In: IEEE conference on computer vision and pattern recognition (CVPR)
  33. Donahue J, Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39:677–691
    Article Google Scholar
  34. Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2014) Deep captioning with multimodal recurrent neural networks. In: arXiv: Computer Vision and Pattern Recognition
  35. Hendricks LA, Venugopalan S, Rohrbach M, Mooney RJ, Saenko K, Darrell T (2015) Deep compositional captioning: describing novel object categories without paired training data. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1–10
  36. Karpathy A, Joulin A, Li F (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: NIPS'14: Proceedings of the 27th international conference on neural information processing systems
  37. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  38. Su Q, Hu J, Li Z (2024) Visual contextual relationship augmented transformer for image captioning. Appl Intell 54:4794–4813
    Article Google Scholar
  39. Lin TY, Maire M, Belongie S, Bourdev L, Girshick R, Hays J, Perona P, Ramanan D, Zitnick CL, Dollár P (2014) Microsoft COCO: common objects in context. In: European conference on computer vision
  40. Deng J, Dong W, Socher R, Li L, Li K, Li F (2009) ImageNet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition
  41. Krasin I, Duerig T, Alldrin N, Veit A, Abu-El-Haija S, Belongie S, Cai D, Feng Z, Ferrari V, Gomes V (2016) OpenImages: a public dataset for large-scale multi-label and multi-class image classification. arXiv:1811.00982
  42. Venugopalan S, Hendricks LA, Rohrbach M, Mooney RJ, Darrell T, Saenko K (2017) Captioning images with diverse objects. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR)
  43. Hu X, Yin X, Lin K, Zhang L, Gao J, Wang L, Liu Z (2021) VIVO: visual vocabulary pre-training for novel object captioning. In: Proceedings of the AAAI conference on artificial intelligence
  44. Wu Yu, Jiang Lu, Yang Yi (2023) Switchable novel object captioner. IEEE Trans Pattern Anal Mach Intell 45(1):1162–1173
    Article Google Scholar
  45. Demirel B, Cinbis RG, Ikizler-Cinbis N (2019) Image captioning with unseen objects. In: British machine vision conference
  46. Wu Y, Zhu L, Jiang L, Yang Y (2018) Decoupled novel object captioner. In: MM '18: proceedings of the 26th ACM international conference on Multimedia
  47. Wei J, Li Z, Zhang C, Ma H (2024) Mining core information by evaluating semantic importance for unpaired image captioning. Neural Netw 179:106519
    Article Google Scholar
  48. Hu X, Yin X, Lin K, Wang L, Zhang L (2021) Visual vocabulary pre-training for novel object captioning. In: Proceedings of the AAAI conference on artificial intelligence
  49. Anderson P, Fernando B, Johnson M, Gould S (2017) Guided open vocabulary image captioning with constrained beam search. In: Proceedings of the 2017 conference on empirical methods in natural language processing, Copenhagen, Denmark
  50. Chen X, Jiang M, Zhao Q (2021) Leveraging human attention in novel object captioning. In: International joint conference on artificial intelligence
  51. Agrawal H, Desai K, Wang Y, Chen X, Jain R, Johnson M, Batra D, Parikh D (2019) Nocaps: novel object captioning at scale. In: IEEE/CVF international conference on computer vision (ICCV)
  52. Wang Y, Wood ID, Wan S, Johnson M (2021) ECOL-R: encouraging copying in novel object captioning with reinforcement learning. In: Proceedings of the 16th conference of the european chapter of the association for computational linguistics
  53. Du S, Zhu H, Lin G, Wang D, Shi J (2023) Novel object captioning with semantic match from external knowledge. In: Applied sciences
  54. Huaa P, Sun H, Haoa J, Liub C, Wanga J, Qia Q, Liaoa J (2023) Reasoning guided by a manual: context-aware image captioning with novel objects. In: 26th European conference on artificial intelligence ECAI 2023
  55. Zheng H, Wu J, Liang R, Li Y, Li x (2019) Multi-task learning for captioning images with novel words. IET Comput Vis 13:294–301
    Article Google Scholar
  56. Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
  57. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: advances in neural information processing systems
  58. Huy DQ (2020) viblo.asia. 2020. [Online]. Available: https://viblo.asia/p/multi-task-learning-mot-so-dieu-ban-nen-biet-3P0lPD08lox.
  59. Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT 2019
  60. Li LH, Yatskar M, Yin D, Hsieh CJ, Chang KW (2019) VisualBERT: a simple and performant baseline for vision and language. arXiv:1908.03557
  61. Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2019) VL-BERT: pre-training of generic visual-linguistic representations. arXiv:1908.08530
  62. Fariha A (2018) Automatic image captioning using multi-task learning
  63. Huang JT, Li J, Yu D, Deng L, Gong Y (2013) Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In: IEEE international conference on acoustics, speech, and signal processing
  64. Tang Z, Li L, Wang D (2016) Multi-task recurrent model for speech and speaker recognition. In: Asia-Pacific signal and information processing association annual summit and conference (APSIPA), pp 1–4
  65. Vaessen N, van Leeuwen D (2023) Towards multi-task learning of speech and speaker recognition. In: INTERSPEECH 2023
  66. Gulati A, Qin J, Chiu C, Parmar N, Zhang Y, Yu J, Han W, Wang S, Zhang Z, Wu Y, Pang R (2020) Conformer: convolution-augmented transformer for speech recognition. In: INTERSPEECH 2020
  67. Jocher G, Chaurasia A, Qiu J (2023) Ultralytics YOLO
  68. Feng Q, Wu Y, Fan H, Yan C, Yang Y (2020) Cascaded revision network for novel object captioning. IEEE Trans Circuits Syst Video Technol 30:3413–3421
    Article Google Scholar
  69. Hua P, Sun H, Hao J, Liu C, Wang J, Qi Q, Liao J (2023) Reasoning guided by a manual: context-aware image captioning with novel objects. In: 26th European conference on artificial intelligence-ECAI 2023
  70. Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F, Choi Y, Gao J (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. In: 16th European Conference

Download references