The HA4M dataset: Multi-Modal Monitoring of an assembly task for Human Action recognition in Manufacturing

This paper introduces the Human Action Multi-Modal Monitoring in Manufacturing (HA4M) dataset, a collection of multi-modal data relative to actions performed by different subjects building an Epicyclic Gear Train (EGT). In particular, 41 subjects executed several trials of the assembly task, which consists of 12 actions. Data were collected in a laboratory scenario using a Microsoft ® Azure Kinect which integrates a depth camera, an RGB camera, and InfraRed (IR) emitters. To the best of authors' knowledge, the HA4M dataset is the first multi-modal dataset about an assembly task containing six types of data: RGB images, Depth maps, IR images, RGB-to-Depth-Aligned images, Point Clouds and Skeleton data. These data represent a good foundation to develop and test advanced action recognition systems in several fields, including Computer Vision and Machine Learning, and application domains such as smart manufacturing and human-robot collaboration. Background & Summary Human action recognition is an active topic of research in computer vision 1,2 and machine learning 3,4 and vast research work has been carried out in the last decade, as seen in the existing literature 5. Moreover, the recent widespread of low-cost video camera systems, including depth-cameras 6 , has strengthened the development of observation systems in a variety of application domains such as video-surveillance, safety and smart home security, ambient assisted living, health-care and so on. However, little work has been done in human action recognition for manufacturing assembly 7-9 and the poor availability of public datasets limits the study, development, and comparison of new methods. This is mainly due to challenging issues such as between-action similarity, the complexity of actions, the manipulation of tools and parts, the presence of fine motions and intricate operations. The recognition of human actions in the context of intelligent manufacturing is of great importance for various purposes: to improve operational efficiency 8 ; to promote human-robot cooperation 10 ; to assist operators 11 ; to support employee training 9,12 ; to increase productivity and safety 13 ; or to promote workers' good mental health 14. In this paper, we present the Human Action Multi-Modal Monitoring in Manufacturing (HA4M) dataset which is a multi-modal dataset acquired by an RGB-D camera during the assembly of an Epicyclic Gear Train (EGT) (see Fig. 1). The HA4M dataset provides a good base for developing, validating and testing techniques and methodologies to recognize assembly actions. Literature is rich in RGB-D datasets for human action recognition 15-17 prevalently acquired in indoor/outdoor unconstrained settings. They are mostly related to daily actions (such as walking, jumping, waving, bending, etc.), medical conditions (such as headache, back pain, staggering, etc.), two-person interactions (such as hugging, taking a photo, finger-pointing, giving object, etc.), or gaming actions (such as forward punching, tennis serving, golf swinging, etc.). Table 1 reports some of the most famous and commonly used RGB-D datasets on human action recognition describing their principal peculiarities.

The HA4M dataset: Multi-Modal Monitoring of an assembly task for Human Action recognition in Manufacturing (original) (raw)