Multi-proposal collaboration and multi-task training for weakly-supervised video moment retrieval (original) (raw)
References
Collins RT, Lipton AJ, Kanade T, Fujiyoshi H, Duggins D, Tsin Y, Tolliver D, Enomoto N, Hasegawa O, Burt P, Wixson L (2000) A system for video surveillance and monitoring. VSAM Final Rep 2000:1–68 Google Scholar
He Q, Shi R, Chen L, Huo L (2024) Video anomaly detection based on multi-scale optical flow spatio-temporal enhancement and normality mining. Int J Mach Learn Cybern 15:1–16 Google Scholar
Kemp CC, Edsinger A, Torres-Jara E (2007) Challenges for robot manipulation in human environments [grand challenges of robotics]. IEEE Robot Autom Mag 14:20–29 Article Google Scholar
Anne Hendricks L, Wang O, Shechtman E, Sivic J, Darrell T, Russell B (2017) Localizing moments in video with natural language. In: Proceedings of the 16th IEEE international conference on computer vision. IEEE, Venice, Italy, pp 5803–5812
Gao J, Sun C, Yang Z, Nevatia R (2017) Tall: temporal activity localization via language query. In: Proceedings of the 16th IEEE international conference on computer vision. IEEE, Venice, Italy, pp 5267–5275
Zhang B, Jiang B, Yang C, Pang L ( 2022) Dual-channel localization networks for moment retrieval with natural language. In: Proceedings of the 2022 international conference on multimedia retrieval. ACM, Newark, NJ, USA, pp 351–359
Zhang B, Yang C, Jiang B, Zhou X (2022) Video moment retrieval with hierarchical contrastive learning. In: Proceedings of the 30th ACM international conference on multimedia. ACM, Lisbon, Portugal, pp 346–355
Liu M, Wang X, Nie L, He X, Chen B, Chua T-S (2018) Attentive moment retrieval in videos. In: The 41st international ACM SIGIR conference on research & development in information retrieval. ACM, Ann Arbor, MI, USA, pp 15–24
Wang Y, Liu M, Wei Y, Cheng Z, Wang Y, Nie L (2022) Siamese alignment network for weakly supervised video moment retrieval. IEEE Trans Multimed 25:3921–3933 Article Google Scholar
Yoon S, Koo G, Kim D, Yoo CD (2023) Scanet: scene complexity aware network for weakly-supervised video moment retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision. IEEE/CVF, Paris, France, pp 13576–13586
Huang Y, Yang L, Sato Y (2023) Weakly supervised temporal sentence grounding with uncertainty-guided self-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE/CVF, Vancouver, Canada, pp 18908–18918
Lv Z, Su B, Wen J-R (2023) Counterfactual cross-modality reasoning for weakly supervised video moment localization. In: Proceedings of the 31st ACM international conference on multimedia. ACM, Ottawa, Canada, pp 6539–6547
Mithun NC, Paul S, Roy-Chowdhury AK (2019) Weakly supervised video moment retrieval from text queries. In: Proceedings of the 2019 IEEE/CVF conference on computer vision and pattern recognition. IEEE/CVF, Long Beach, CA, USA, pp 11592–11601
Tan R, Xu H, Saenko K, Plummer BA (2021) Logan: latent graph co-attention network for weakly-supervised video moment retrieval. In: Proceedings of the 2021 IEEE/CVF winter conference on applications of computer vision. IEEE/CVF, Waikoloa, HI, USA, pp 2083–2092
Huang J, Liu Y, Gong S, Jin H (2021) Cross-sentence temporal and semantic relations in video activity localisation. In: Proceedings of the 18th IEEE/CVF international conference on computer vision. IEEE/CVF, Montreal, Canada, pp 7199–7208
Yang W, Zhang T, Zhang Y, Wu F (2021) Local correspondence network for weakly supervised temporal sentence grounding. IEEE Trans Image Process 30:3252–3262 Article Google Scholar
Duan X, Huang W, Gan C, Wang J, Zhu W, Huang J (2018) Weakly supervised dense event captioning in videos. Adv Neural Inf Process Syst 31:1–11 Google Scholar
Lin Z, Zhao Z, Zhang Z, Wang Q, Liu H (2020) Weakly-supervised video moment retrieval via semantic completion network. In: Proceedings of the AAAI conference on artificial intelligence. AAAI Press, New York, NY, USA, pp 11539–11546
Chen S, Jiang Y-G (2021) Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. In: Proceedings of the 2021 IEEE/CVF conference on computer vision and pattern recognition. IEEE/CVF, Nashville, TN, USA, pp 8425–8435
Zheng M, Huang Y, Chen Q, Liu Y (2022) Weakly supervised video moment localization with contrastive negative sample mining. In: Proceedings of the AAAI conference on artificial intelligence. AAAI Press, Palo Alto, CA, USA, pp 3517–3525
Zhang H, Sun A, Jing W, Zhou JT (2023) Temporal sentence grounding in videos: a survey and future directions. IEEE Trans Pattern Anal Mach Intell 45:10443–10465 Article Google Scholar
Liu M, Nie L, Wang Y, Wang M, Rui Y (2023) A survey on video moment localization. ACM Comput Surv 55:1–37 Google Scholar
Gao M, Davis LS, Socher R, Xiong C (2019) Wslln: weakly supervised natural language localization networks. Computing Research Repository arXiv Preprint, arXiv:1909.00239
Ma M, Yoon S, Kim J, Lee Y, Kang S, Yoo CD (2020) VLANet: Video-language alignment network for weakly-supervised video moment retrieval. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII. Springer, Glasgow, UK, pp 156–171
Chen Z, Ma L, Luo W, Tang P, Wong K-YK (2020) Look closer to ground better: weakly-supervised temporal grounding of sentence in video. Computing Research Repository arXiv Preprint, arXiv:2001.09308
Wang Y, Deng J, Zhou W, Li H (2022) Weakly supervised temporal adjacent network for language grounding. IEEE Trans Multimed 24:3276–3286 Article Google Scholar
Song Y, Wang J, Ma L, Yu Z, Yu J (2020) Weakly-supervised multi-level attentional reconstruction network for grounding textual queries in videos. Computing Research Repository arXiv Preprint, arXiv:2003.07048
Zhang Z, Lin Z, Zhao Z, Zhu J, He X (2020) Regularized two-branch proposal networks for weakly-supervised moment retrieval in videos. In: Proceedings of the 28th ACM international conference on multimedia. ACM, Seattle, WA, USA, pp 4098–4106
Nam J, Ahn D, Kang D, Ha SJ, Choi J (2021) Zero-shot natural language video localization. In: Proceedings of the 18th IEEE/CVF international conference on computer vision. IEEE/CVF, Montreal, Canada, pp 1470–1479
Gao J, Xu C (2021) Learning video moment retrieval without a single annotated video. IEEE Trans Circuits Syst Video Technol 32:1646–1657 Article Google Scholar
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing. Association for Computational Linguistics, Doha, Qatar, pp 1532–1543
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I (2021) Learning transferable visual models from natural language supervision. In: Proceedings of the 38th international conference on machine learning. pp 8748–8763
Carreira J, Zisserman A ( 2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognition. IEEE, Honolulu, HI, USA, pp 6299–6308
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:1–11 Google Scholar
Bahdanau D, Cho KH, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd international conference on learning representations. ICLR, San Diego, CA, USA, pp 1–10
Zheng M, Huang Y, Chen Q, Peng Y, Liu Y (2022) Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning. In: Proceedings of the 2022 IEEE/CVF conference on computer vision and pattern recognition. pp 15555–15564
Krishna R, Hata K, Ren F, Fei-Fei L, Carlos Niebles J (2017) Dense-captioning events in videos. In: Proceedings of the 16th IEEE international conference on computer vision. IEEE, Venice, Italy, pp 706–715
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. Computing Research Repository arXiv Preprint, arXiv:1412.6980
Wu H, Lyu Y, Shen X, Zhao X, Wang M, Zhang X, Luo Z (2023) Atomic-action-based contrastive network for weakly supervised temporal language grounding. In: 2023 IEEE international conference on multimedia and expo (ICME). pp 1523–1528
Song Y, Wang J, Ma L, Yu J, Liang J, Yuan L, Yu Z (2023) MARN: multi-level attentional reconstruction networks for weakly supervised video temporal grounding. Neurocomputing 554:126625 Article Google Scholar