Multi-proposal collaboration and multi-task training for weakly-supervised video moment retrieval (original) (raw)

References

  1. Collins RT, Lipton AJ, Kanade T, Fujiyoshi H, Duggins D, Tsin Y, Tolliver D, Enomoto N, Hasegawa O, Burt P, Wixson L (2000) A system for video surveillance and monitoring. VSAM Final Rep 2000:1–68
    Google Scholar
  2. He Q, Shi R, Chen L, Huo L (2024) Video anomaly detection based on multi-scale optical flow spatio-temporal enhancement and normality mining. Int J Mach Learn Cybern 15:1–16
    Google Scholar
  3. Kemp CC, Edsinger A, Torres-Jara E (2007) Challenges for robot manipulation in human environments [grand challenges of robotics]. IEEE Robot Autom Mag 14:20–29
    Article Google Scholar
  4. Anne Hendricks L, Wang O, Shechtman E, Sivic J, Darrell T, Russell B (2017) Localizing moments in video with natural language. In: Proceedings of the 16th IEEE international conference on computer vision. IEEE, Venice, Italy, pp 5803–5812
  5. Gao J, Sun C, Yang Z, Nevatia R (2017) Tall: temporal activity localization via language query. In: Proceedings of the 16th IEEE international conference on computer vision. IEEE, Venice, Italy, pp 5267–5275
  6. Zhang B, Jiang B, Yang C, Pang L ( 2022) Dual-channel localization networks for moment retrieval with natural language. In: Proceedings of the 2022 international conference on multimedia retrieval. ACM, Newark, NJ, USA, pp 351–359
  7. Zhang B, Yang C, Jiang B, Zhou X (2022) Video moment retrieval with hierarchical contrastive learning. In: Proceedings of the 30th ACM international conference on multimedia. ACM, Lisbon, Portugal, pp 346–355
  8. Liu M, Wang X, Nie L, He X, Chen B, Chua T-S (2018) Attentive moment retrieval in videos. In: The 41st international ACM SIGIR conference on research & development in information retrieval. ACM, Ann Arbor, MI, USA, pp 15–24
  9. Wang Y, Liu M, Wei Y, Cheng Z, Wang Y, Nie L (2022) Siamese alignment network for weakly supervised video moment retrieval. IEEE Trans Multimed 25:3921–3933
    Article Google Scholar
  10. Yoon S, Koo G, Kim D, Yoo CD (2023) Scanet: scene complexity aware network for weakly-supervised video moment retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision. IEEE/CVF, Paris, France, pp 13576–13586
  11. Huang Y, Yang L, Sato Y (2023) Weakly supervised temporal sentence grounding with uncertainty-guided self-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE/CVF, Vancouver, Canada, pp 18908–18918
  12. Lv Z, Su B, Wen J-R (2023) Counterfactual cross-modality reasoning for weakly supervised video moment localization. In: Proceedings of the 31st ACM international conference on multimedia. ACM, Ottawa, Canada, pp 6539–6547
  13. Mithun NC, Paul S, Roy-Chowdhury AK (2019) Weakly supervised video moment retrieval from text queries. In: Proceedings of the 2019 IEEE/CVF conference on computer vision and pattern recognition. IEEE/CVF, Long Beach, CA, USA, pp 11592–11601
  14. Tan R, Xu H, Saenko K, Plummer BA (2021) Logan: latent graph co-attention network for weakly-supervised video moment retrieval. In: Proceedings of the 2021 IEEE/CVF winter conference on applications of computer vision. IEEE/CVF, Waikoloa, HI, USA, pp 2083–2092
  15. Huang J, Liu Y, Gong S, Jin H (2021) Cross-sentence temporal and semantic relations in video activity localisation. In: Proceedings of the 18th IEEE/CVF international conference on computer vision. IEEE/CVF, Montreal, Canada, pp 7199–7208
  16. Yang W, Zhang T, Zhang Y, Wu F (2021) Local correspondence network for weakly supervised temporal sentence grounding. IEEE Trans Image Process 30:3252–3262
    Article Google Scholar
  17. Duan X, Huang W, Gan C, Wang J, Zhu W, Huang J (2018) Weakly supervised dense event captioning in videos. Adv Neural Inf Process Syst 31:1–11
    Google Scholar
  18. Lin Z, Zhao Z, Zhang Z, Wang Q, Liu H (2020) Weakly-supervised video moment retrieval via semantic completion network. In: Proceedings of the AAAI conference on artificial intelligence. AAAI Press, New York, NY, USA, pp 11539–11546
  19. Chen S, Jiang Y-G (2021) Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. In: Proceedings of the 2021 IEEE/CVF conference on computer vision and pattern recognition. IEEE/CVF, Nashville, TN, USA, pp 8425–8435
  20. Zheng M, Huang Y, Chen Q, Liu Y (2022) Weakly supervised video moment localization with contrastive negative sample mining. In: Proceedings of the AAAI conference on artificial intelligence. AAAI Press, Palo Alto, CA, USA, pp 3517–3525
  21. Zhang H, Sun A, Jing W, Zhou JT (2023) Temporal sentence grounding in videos: a survey and future directions. IEEE Trans Pattern Anal Mach Intell 45:10443–10465
    Article Google Scholar
  22. Liu M, Nie L, Wang Y, Wang M, Rui Y (2023) A survey on video moment localization. ACM Comput Surv 55:1–37
    Google Scholar
  23. Gao M, Davis LS, Socher R, Xiong C (2019) Wslln: weakly supervised natural language localization networks. Computing Research Repository arXiv Preprint, arXiv:1909.00239
  24. Ma M, Yoon S, Kim J, Lee Y, Kang S, Yoo CD (2020) VLANet: Video-language alignment network for weakly-supervised video moment retrieval. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII. Springer, Glasgow, UK, pp 156–171
  25. Chen Z, Ma L, Luo W, Tang P, Wong K-YK (2020) Look closer to ground better: weakly-supervised temporal grounding of sentence in video. Computing Research Repository arXiv Preprint, arXiv:2001.09308
  26. Wang Y, Deng J, Zhou W, Li H (2022) Weakly supervised temporal adjacent network for language grounding. IEEE Trans Multimed 24:3276–3286
    Article Google Scholar
  27. Song Y, Wang J, Ma L, Yu Z, Yu J (2020) Weakly-supervised multi-level attentional reconstruction network for grounding textual queries in videos. Computing Research Repository arXiv Preprint, arXiv:2003.07048
  28. Zhang Z, Lin Z, Zhao Z, Zhu J, He X (2020) Regularized two-branch proposal networks for weakly-supervised moment retrieval in videos. In: Proceedings of the 28th ACM international conference on multimedia. ACM, Seattle, WA, USA, pp 4098–4106
  29. Nam J, Ahn D, Kang D, Ha SJ, Choi J (2021) Zero-shot natural language video localization. In: Proceedings of the 18th IEEE/CVF international conference on computer vision. IEEE/CVF, Montreal, Canada, pp 1470–1479
  30. Gao J, Xu C (2021) Learning video moment retrieval without a single annotated video. IEEE Trans Circuits Syst Video Technol 32:1646–1657
    Article Google Scholar
  31. Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing. Association for Computational Linguistics, Doha, Qatar, pp 1532–1543
  32. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I (2021) Learning transferable visual models from natural language supervision. In: Proceedings of the 38th international conference on machine learning. pp 8748–8763
  33. Carreira J, Zisserman A ( 2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the 2017 IEEE conference on computer vision and pattern recognition. IEEE, Honolulu, HI, USA, pp 6299–6308
  34. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:1–11
    Google Scholar
  35. Bahdanau D, Cho KH, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd international conference on learning representations. ICLR, San Diego, CA, USA, pp 1–10
  36. Zhou Z-H (2009) Ensemble learning. Encycl Biom 270–273. https://doi.org/10.1007/978-0-387-73003-5_293
  37. Zheng M, Huang Y, Chen Q, Peng Y, Liu Y (2022) Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning. In: Proceedings of the 2022 IEEE/CVF conference on computer vision and pattern recognition. pp 15555–15564
  38. Krishna R, Hata K, Ren F, Fei-Fei L, Carlos Niebles J (2017) Dense-captioning events in videos. In: Proceedings of the 16th IEEE international conference on computer vision. IEEE, Venice, Italy, pp 706–715
  39. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. Computing Research Repository arXiv Preprint, arXiv:1412.6980
  40. Wu H, Lyu Y, Shen X, Zhao X, Wang M, Zhang X, Luo Z (2023) Atomic-action-based contrastive network for weakly supervised temporal language grounding. In: 2023 IEEE international conference on multimedia and expo (ICME). pp 1523–1528
  41. Song Y, Wang J, Ma L, Yu J, Liang J, Yuan L, Yu Z (2023) MARN: multi-level attentional reconstruction networks for weakly supervised video temporal grounding. Neurocomputing 554:126625
    Article Google Scholar

Download references