Combining self-attention and depth-wise convolution for human pose estimation (original) (raw)
References
Li, B., Dai, Y., Cheng, X., Chen, H., Lin, Y., He, M.: Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN, pp. 601–604 (2017)
Li, B., He, M., Dai, Y., Cheng, X., Chen, Y.: 3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN. Multimedia Tools Appl. 77, 22901–22921 (2018) Article Google Scholar
Luvizon, D.C., Picard, D., Tabia, H.: 2D/3D pose estimation and action recognition using multitask deep learning, pp. 5137–5146 (2018)
Li, B., Chen, H., Chen, Y., Dai, Y., He, M.: Skeleton boxes: solving skeleton based action detection with a single deep convolutional neural network, pp. 613–616 (2017)
Zhou, T., Wang, W., Qi, S., Ling, H., Shen, J.: Cascaded human-object interaction recognition, pp. 4263–4272 (2020)
Wan, B., Zhou, D., Liu, Y., Li, R., He, X.: Pose-aware multi-level feature network for human object interaction detection, pp. 9469–9478 (2019)
Yang, J., Zhang, J., Yu, F., Jiang, X., Zhang, M., Sun, X., Chen, Y.-C., Zheng, W.-S.: Learning to know where to see: a visibility-aware approach for occluded person re-identification, pp. 11885–11894 (2021)
Chen, H., Lagadec, B., Bremond, F.: Ice: Inter-instance contrastive encoding for unsupervised person re-identification, pp. 14960–14969 (2021)
Luo, Z., Wang, Z., Huang, Y., Wang, L., Tan, T., Zhou, E.: Rethinking the heatmap regression for bottom-up human pose estimation, pp. 13264–13273 (2021)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation, pp. 5693–5703 (2019)
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998) Article Google Scholar
Corbetta, M., Shulman, G.L.: Control of goal-directed and stimulus-driven attention in the brain. Nat. Rev. Neurosci. 3(3), 201–215 (2002) Article Google Scholar
Rensink, R.A.: The dynamic representation of scenes. Vis. Cogn. 7(1–3), 17–42 (2000) Article Google Scholar
Pan, X., Ge, C., Lu, R., Song, S., Chen, G., Huang, Z., Huang, G.: On the integration of self-attention and convolution, pp. 815–825 (2022)
Zhu, L., Wang, X., Ke, Z., Zhang, W., Lau, R.W.: Biformer: Vision transformer with bi-level routing attention, pp. 10323–10333 (2023)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). arXiv preprint arXiv:1810.04805
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł, Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Li, W., Chen, H., Guo, J., Zhang, Z., Wang, Y.: Brain-inspired multilayer perceptron with spiking neurons, pp. 783–793 (2022)
Guo, J., Tang, Y., Han, K., Chen, X., Wu, H., Xu, C., Xu, C., Wang, Y.: Hire-MLP: Vision MLP via hierarchical rearrangement, pp. 826–836 (2022)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2020). arXiv preprint arXiv:2010.11929
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention, pp. 10347–10357 (2021)
Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L.: Localvit: Bringing locality to vision transformers (2021). arXiv preprint arXiv:2104.05707
Ramakrishna, V., Munoz, D., Hebert, M., Andrew Bagnell, J., Sheikh, Y.: Pose machines: Articulated pose estimation via inference machines, pp. 33–47 (2014)
Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in neural information processing systems, vol. 27 (2014)
Papandreou, G., Zhu, T., Chen, L.-C., Gidaris, S., Tompson, J., Murphy, K.: Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model, pp. 269–286 (2018)
Li, K., Wang, S., Zhang, X., Xu, Y., Xu, W., Tu, Z.: Pose recognition with cascade transformers, pp. 1944–1953 (2021)
Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., Wang, J.: Hrformer: High-resolution transformer for dense prediction (2021). arXiv preprint arXiv:2110.09408
Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose: Simple vision transformer baselines for human pose estimation. Adv. Neural. Inf. Process. Syst. 35, 38571–38584 (2022) Google Scholar
Yang, S., Quan, Z., Nie, M., Yang, W.: Transpose: Keypoint localization via transformer, pp. 11802–11812 (2021)
Huang, L., Tan, J., Liu, J., Yuan, J.: Hand-transformer: non-autoregressive structured modeling for 3D hand pose estimation, pp. 17–33 (2020)
Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers, pp. 1954–1963 (2021)
Li, Y., Zhang, S., Wang, Z., Yang, S., Yang, W., Xia, S.-T., Zhou, E.: Tokenpose: Learning keypoint tokens for human pose estimation, pp. 11313–11322 (2021)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context, pp. 740–755 (2014)
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis, pp. 3686–3693 (2014)
Zhang, F., Zhu, X., Dai, H., Ye, M., Zhu, C.: Distribution-aware coordinate representation for human pose estimation, pp. 7093–7102 (2020)
Stoffl, L., Vidal, M., Mathis, A.: End-to-end trainable multi-instance pose estimation with transformers (2021). arXiv preprint arXiv:2103.12115