The modified method for position estimation of human body joints in video sequences

Main Article Content

Denys Volodymyrovych Soldatov
https://orcid.org/0000-0002-2194-7717
Anton Yuriiovych Varfolomieiev
https://orcid.org/0000-0002-6990-7140

Abstract

Reliable recognition of human movements has a wide range of applications, including games, human-computer interaction, security and healthcare. In recent years, computer graphics and computer vision researchers have developed plenty of new motion-capture algorithms that operate on simpler hardware and with far fewer limitations than before. The objective of this paper is to improve the accuracy of estimation of human skeleton joints positions in video sequences. Particularly the proposed method in this paper consists of five blocks. The input sequence of images is fed to the tracking and motion compensation unit where the tracking algorithm determines the object displacement and centers it within the frame. The motion information is also propagated to the additional unit of point-of-view estimation. This unit calculates the motion angles in the frame and monitors the object size, thus determining whether the object is approaching or moving away from the camera, and then feeds these data to the neural network. The network consists of three convolutional layers. Each convolutional layer is followed by a pooling layer. The last pooling layer connects to the cascade of three fully connected layers. All activation functions in these layers are the ReLU ones, except the last layer, where the linear activation is used. The HOG3D features treated as the input of the first convolutional layer. The data from the point-of-view, tracking and motion compensation unit goes directly to the input of fully connected layers. To cope with inaccurate or undetected joints positions, the method uses the additional procedure, which determines unreliable joints and extrapolates their new positions from the previous ones using the additional neural network. It is assumed that this structure of the method improves the position prediction accuracy due to the following reasons: taking into account the information about motion angles and zooming allows to distinguish movements that are similar in centered frames but different in displacement; using of adaptive window size for HOG3D features; using the neural network to extrapolate the positions of joints in case of absence of the prediction or in case of its low accuracy. Experiments on the HumanEva-1 dataset confirmed that the suggested modifications permit achieving higher accuracies, and thus the prospect of the use of proposed modified method to predict the body position in motion recognition systems.

Article Details

How to Cite
[1]
D. V. Soldatov and A. Y. Varfolomieiev, “The modified method for position estimation of human body joints in video sequences”, Мікросист., Електрон. та Акуст., vol. 24, no. 6, pp. 53–59, Dec. 2019.
Section
Electronic Systems and Signals

References

A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. End-toend Recovery of Human Shape and Pose. CVPR, [Online]. Available: https://arxiv.org/abs/1712.06584, 2018. [Accessed 29 11 2019]

D. Xiang, H. Joo, and Y. Sheikh. Monocular Total Capture: Posing Face, Body, and Hands in the Wild. [Online]. Available: https://arxiv.org/abs/1812.01598, 2018. [Accessed 29 11 2019]

D.Mehta, O. Sotnychenko, F. Mueller, W. Xu, M. Elgharib, P. Fua, H.P. Seidel, H. Rhodin, G. Pons-Moll, C. Theobalt, XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera, [Online] Available: https://arxiv.org/abs/1907.008372019 [Accessed 29 11 2019]

C. Ionescu, D. Papava, V. Olaru and C. Sminchisescu, Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, No. 7, July 2014.

L. Sigal, A. Balan and M. J. Black, "HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion," International Journal of Computer Vision (IJCV), vol. 87, pp. 4–27, 2010. DOI: 10.1007/s11263-009-0273-6

S. Li and A. B. Chan, "3D Human Pose Estimation from Monocular Images with Deep Convolutional Network," Asian Conference on Computer Vision (ACCV), 2014. DOI: 10.1007/978-3-319-16808-1_23

N.C. Camgoz, S. Hadfield, O. Koller and R. Bowden, "Using convolutional 3D neural networks for userindependent continuous gesture recognition," 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 49–54, 2016. DOI: 10.1109/ICPR.2016.7899606

C. Cao, C. Lan, Y. Zhang, W. Zeng, H. Lu and Y. Zhang. "Skeleton-Based Action Recognition with Gated Convolutional Neural Networks," IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 11, pp. 3247–3257, 2018. DOI: 10.1109/TCSVT.2018.2879913

Y. Du, Y. Fu, and L. Wang, "Skeleton based action recognition with convolutional neural network," 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pp. 579–583, 2015. DOI: 10.1109/ACPR.2015.7486569

Skeleton-Based Action Recognition with Directed Graph Neural Networks. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) 2019. DOI: 10.1109/CVPR.2019.00810

L. Li, W. Zheng, Z. Zhang, Y. Huang, and L. Wang, "Skeleton-Based Relational Modeling for Action Recognition" [Online]. Available: https://arxiv.org/abs/1805.02556, 2018. [Accessed 29 11 2019]

L. Shi, Y. Zhang, J. Cheng, and H. Lu. NonLocal Graph Convolutional Networks for Skeleton-Based Action Recognition. [Online] Available: https://arxiv.org/abs/1805.07694, May 2018. [Accessed 29 11 2019]

R. Urtasun, D. Fleet, and P. Fua, "3D People Tracking with Gaussian Process Dynamical Models," IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2006. DOI: 10.1109/CVPR.2006.15

C. Sminchisescu and B. Triggs, "Covariance Scaled Sampling for Monocular 3D Body Tracking," IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2001. DOI: 10.1109/CVPR.2001.990509

M. Burenius, J. Sullivan and S. Carlsson, "3D Pictorial Structures for Multiple View Articulated Pose Estimation," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013. DOI: 10.1109/CVPR.2013.464

V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab and S. Ilic, "3D Pictorial Structures for Multiple Human Pose Estimation" IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. DOI: 10.1109/CVPR.2014.216

P. Felzenszwalb, R. Girshick, D. McAllester and D. Ramanan, "Object Detection with Discriminatively Trained Part Based Models," IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 32, no. 9, pp. 1627–1645, 2010. DOI: 10.1109/TPAMI.2009.167

B. Sapp, A. Toshev and B. Taskar, "Cascaded Models for Articulated Pose Estimation," Computer Vision – ECCV 2010. ECCV 2010. Lecture Notes in Computer Science, vol. 6312, pp. 406–420, 2010. DOI: 10.1007/978-3-642-15552-9_30

A. Agarwal and B. Triggs, "3D Human Pose from Silhouettes by Relevance Vector Regression," IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2004. DOI: 10.1109/CVPR.2004.1315258

L. Sigal, A. Balan and M. J. Black, "Combined Discriminative and Generative Articulated Pose and Non-rigid Shape Estimation," Advances in Neural Information Processing Systems (NIPS), 2007.

C. Ionescu, I. Papava, V. Olaru and C. Sminchisescu. "Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments," IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 36, no. 7, pp. 1325–1339, 2014. DOI: 10.1109/TPAMI.2013.248

J. Shotton, A. Fitzgibbon, M. Cook and A. Blake, "Real-Time Human Pose Recognition in Parts from a Single Depth Image," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011. DOI: 10.1109/CVPR.2011.5995316

C. Ionescu, J. Carreira and C. Sminchisescu, "Iterated Second-Order Label Sensitive Pooling for 3D Human Pose Estimation," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. DOI: 10.1109/CVPR.2014.215

M. Andriluka, S. Roth and B. Schiele, "Monocular 3D Pose Estimation and Tracking by Detection," IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2010. DOI: 10.1109/CVPR.2010.5540156

M. Hofmann and D. M. Gavrila, "Multi-view 3D Human Pose Estimation in Complex Environment," International Journal of Computer Vision (IJCV), vol. 96, pp. 103–124, 2012. DOI: 10.1007/s11263-011-0451-1

S. Zuffi, J. Romero, C. Schmid and M. J. Black, "Estimating Human Pose with Flowing Puppets," IEEE International Conference on Computer Vision (ICCV), 2013. DOI: 10.1109/ICCV.2013.411

B. Tekin, X. Sun, X. Wang, V. Lepetit and P. Fua, "Predicting People's 3D Poses from Short Sequences" [Online]. Available: https://arxiv.org/abs/1504.08200, 2018. [Accessed 29 11 2019]

D. Weinland, M. Ozuysal and P. Fua, "Making Action Recognition Robust to Occlusions and Viewpoint Changes," Computer Vision – ECCV 2010. ECCV 2010. Lecture Notes in Computer Science, vol. 6313, pp. 635–648, 2010. DOI: 10.1007/978-3-642-15558-1_46

D. Park, C. L. Zitnick, D. Ramanan and P. Dollar, "Exploring Weak Stabilization for Motion Feature Extraction," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013. DOI: 10.1109/CVPR.2013.371

S. Li and A.B. Chan. 3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network. In ACCV, 2014 DOI: 10.1007/978-3-319-16808-1_23

Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Improving neural networks by preventing co-adaptation of feature detectors. CoRR (2012) [Online]. Available: https://arxiv.org/abs/1207.0580 [Accessed 29 11 2019]