Su Z, Liu T, Hao X, Hu X (2023) Spatial-temporal graph convolutional networks for traffic flow prediction considering multiple traffic parameters. J Supercomput 79(16):18293–18312 Article Google Scholar
Shi X, Chen Z, Wang H, Yeung D.-Y, Wong W.-K, Woo W.-c (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems 28
Requena-Mesa C, Benson V, Reichstein M, Runge J, Denzler J (2021). Earthnet2021: a large-scale dataset and challenge for earth surface forecasting as a guided video prediction task. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1132–1142
Kocabas M, Athanasiou N, Black MJ (2020) Vibe: video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5253–5263
Liao X, Yuan J, Cai Z, Lai Jh (2023) An attention-based bidirectional GRU network for temporal action proposals generation. J Supercomput 79(8):8322–8339 Article Google Scholar
Sun P, Kretzschmar H, Dotiwalla X, Chouard A, Patnaik V, Tsui P, Guo J, Zhou Y, Chai Y, Caine B, et al (2020) Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2446–2454
Kumar VR, Yogamani S, Rashed H, Sitsu G, Witt C, Leang I, Milz S, Mäder P (2021) Omnidet: surround view cameras based multi-task visual perception network for autonomous driving. IEEE Robot Autom Lett 6(2):2830–2837 Article Google Scholar
Wu H, Yao Z, Wang J, Long M (2021) Motionrnn: a flexible model for video prediction with spacetime-varying motions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 15435–15444
Wang Y, Gao Z, Long M, Wang J, Philip SY (2018) Predrnn++: towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In: International Conference on Machine Learning, pp 5123–5132
Wang Y, Wu H, Zhang J, Gao Z, Wang J, Philip SY, Long M (2022) Predrnn: a recurrent neural network for spatiotemporal predictive learning. IEEE Trans Pattern Anal Mach Intell 45(2):2208–2225 Article Google Scholar
Lin Z, Li M, Zheng Z, Cheng Y, Yuan C (2020) Self-attention ConvLSTM for spatiotemporal prediction. Proc AAAI Conf Artif Intell 34:11531–11538 Google Scholar
Wang Y, Jiang L, Yang M-H, Li L-J, Long M, Fei-Fei L (2019) Eidetic 3D LSTM: a model for video prediction and beyond. In: International Conference on Learning Representations
Chang Z, Zhang X, Wang S, Ma S, Ye Y, Xinguang X, Gao W (2021) Mau: a motion-aware unit for video prediction and beyond. Adv Neural Inf Process Syst 34:26950–26962 Google Scholar
Chang Z, Zhang X, Wang S, Ma S, Gao W (2022) STAM: a spatiotemporal attention based memory for video prediction. IEEE Trans Multimed 25:2354 Article Google Scholar
Lee S, Kim HG, Choi DH, Kim H-I, Ro YM (2021) Video prediction recalling long-term motion context via memory alignment learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3054–3063
Wang Y, Long M, Wang J, Gao Z, Yu PS (2017) PredRNN: recurrent neural networks for predictive learning using spatiotemporal LSTMS. In: Advances in Neural Information Processing Systems 30
Ye X, Bilodeau G-A (2022) VPTR: efficient transformers for video prediction. In: 2022 26th International Conference on Pattern Recognition (ICPR), pp 3492–3499. IEEE
Gao Z, Tan C, Wu L, Li SZ (2022) SimVP: simpler yet better video prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3170–3180
Wang Y, Zhang J, Zhu H, Long M, Wang J, Yu PS (2019) Memory in memory: a predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9154–9162
Gao H, Xu H, Cai Q.-Z, Wang R, Yu F, Darrell T (2019) Disentangling propagation and generation for video prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 9006–9015
Chang Z, Zhang X, Wang S, Ma S, Gao W (2022) STRPM: a spatiotemporal residual predictive model for high-resolution video prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13946–13955
Sun M, Wang W, Zhu X, Liu J (2023) Moso: decomposing motion, scene and object for video prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 18727–18737
Yu W, Lu,Y, Easterbrook S, Fidler S (2019) Crevnet: conditionally reversible video prediction. arXiv preprint arXiv:1910.11577
Gupta A, Tian S, Zhang Y, Wu J, Martín-Martín R, Fei-Fei L (2022) Maskvit: masked visual pre-training for video prediction. arXiv preprint arXiv:2206.11894
Akan AK, Erdem E, Erdem A, Güney F (2021) SLAMP: stochastic latent appearance and motion prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 14728–14737
Denton E, Fergus R (2018) Stochastic video generation with a learned prior. In: International Conference on Machine Learning, pp 1174–1183. PMLR
Babaeizadeh M, Finn C, Erhan D, Campbell R.H, Levine S (2017) Stochastic variational video prediction. arXiv preprint arXiv:1710.11252
Zhang Z, Hu J, Cheng W, Paudel D, Yang J (2024) Extdm: distribution extrapolation diffusion model for video prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19310–19320
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18, pp 234–241. Springer
Zhou Z, Rahman Siddiquee M.M, Tajbakhsh N, Liang J (2018) Unet++: a nested u-net architecture for medical image segmentation. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, pp 3–11. Springer
Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y, Han X, Chen Y-W, Wu J (2020). Unet 3+: A full-scale connected Unet for medical image segmentation. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 1055–1059. IEEE
Alom MZ, Yakopcic C, Hasan M, Taha TM, Asari VK (2019) Recurrent residual u-net for medical image segmentation. J Med Imaging 6(1):014006–014006 Article Google Scholar
Oktay O, Schlemper J, Folgoc L.L, Lee M, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla NY, Kainz B et al (2018) Attention u-net: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Bengio S, Vinyals O, Jaitly N, Shazeer N (2015) Scheduled sampling for sequence prediction with recurrent neural networks. In: Advances in Neural Information Processing Systems 28
Zhang J, Zheng Y, Qi D (2017) Deep spatio-temporal residual networks for citywide crowd flows prediction. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 31
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612 Article Google Scholar
Unterthiner T, Van Steenkiste S, Kurach K, Marinier R, Michalski M, Gelly S (2018). Towards accurate generative models of video: a new metric & challenges. arXiv preprint arXiv:1812.01717
Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using LSTMS. In: International Conference on Machine Learning, pp 843–852
Zhang J, Zheng Y, Qi D (2017) Deep spatio-temporal residual networks for citywide crowd flows prediction. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 31
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004, vol 3, pp 32–36. IEEE
Ionescu C, Papava D, Olaru V, Sminchisescu C (2013) Human3. 6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans Pattern Anal Mach Intell 36(7):1325–1339
Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the kitti dataset. Int J Robot Res 32(11):1231–1237 Article Google Scholar
Villegas R, Yang J, Hong S, Lin X, Lee H (2017) Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033
Tang S, Li C, Zhang P, Tang R (2023) Swinlstm: improving spatiotemporal prediction accuracy using swin transformer and LSTM. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 13470–13479
Tan C, Gao Z, Wu L, Xu Y, Xia J, Li S, Li S.Z (2023) Temporal attention unit: towards efficient spatiotemporal predictive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 18770–18782
Yao Z, Wang Y, Wu H, Wang J, Long M (2023) Modernn: harnessing spatiotemporal mode collapse in unsupervised predictive learning. IEEE Trans Pattern Anal Mach Intell 45(11):13281–13296 Google Scholar
Tang Y, Dong P, Tang Z, Chu X, Liang J (2024) VMRNN: integrating vision mamba and LSTM for efficient and accurate spatiotemporal forecasting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5663–5673
Rusch TK, Chamberlain BP, Mahoney MW, Bronstein MM, Mishra S (2022) Gradient gating for deep multi-rate learning on graphs. arXiv preprint arXiv:2210.00513