jinhui yi | Rheinische Friedrich-Wilhelms-Universität Bonn (original) (raw)
Papers by jinhui yi
As a natural extension of the image synthesis task, video synthesis has attracted a lot of intere... more As a natural extension of the image synthesis task, video synthesis has attracted a lot of interest recently. Many image synthesis works utilize class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when an action starts or ends. To overcome this limitation, we introduce semantic video scene graphs as input for video synthesis, as they represent the spatial and temporal relationships between objects in the scene. Since video scene graphs are usually temporally discrete annotations, we propose a video scene graph (VSG) encoder that not only encodes the existing video scene graphs but also predicts the graph representations for unlabeled frames. The VSG encoder is pretrained with different contrastive multi-modal losses. A semantic scene graph-to-video synthesis framework (SSGVS), based on the pre-trained VSG encoder, VQ-VAE, and autoregressive Transformer, is proposed to synthesize a video given an initial scene image and a non-fixed number of semantic scene graphs. We evaluate SSGVS and other stateof-the-art video synthesis models on the Action Genome dataset and demonstrate the positive significance of video scene graphs in video synthesis. The source code is available at https://github.com/yrcong/SSGVS.
IEEE Robotics and Automation Letters
With the advances in capturing 2D or 3D skeleton data, skeleton-based action recognition has rece... more With the advances in capturing 2D or 3D skeleton data, skeleton-based action recognition has received an increasing interest over the last years. As skeleton data is commonly represented by graphs, graph convolutional networks have been proposed for this task. While current graph convolutional networks accurately recognize actions, they are too expensive for robotics applications where limited computational resources are available. In this paper, we therefore propose a highly efficient graph convolutional network that addresses the limitations of previous works. This is achieved by a parallel structure that gradually fuses motion and spatial information and by reducing the temporal resolution as early as possible. Furthermore, we explicitly address the issue that human poses can contain errors. To this end, the network first refines the poses before they are further processed to recognize the action. We therefore call the network Pose Refinement Graph Convolutional Network. Compared to other graph convolutional networks, our network requires 86%-93% less parameters and reduces the floating point operations by 89%-96% while achieving a comparable accuracy. It therefore provides a much better trade-off between accuracy, memory footprint and processing time, which makes it suitable for robotics applications.
Trajectory forecasting is a crucial step for autonomous vehicles and mobile robots in order to na... more Trajectory forecasting is a crucial step for autonomous vehicles and mobile robots in order to navigate and interact safely. In order to handle the spatial interactions between objects, graph-based approaches have been proposed. These methods, however, model motion on a frameto-frame basis and do not provide a strong temporal model. To overcome this limitation, we propose a compact model called Spatial-Temporal Consistency Network (STC-Net). In STC-Net, dilated temporal convolutions are introduced to model long-range dependencies along each trajectory for better temporal modeling while graph convolutions are employed to model the spatial interaction among different trajectories. Furthermore, we propose a feature-wise convolution to generate the predicted trajectories in one pass and refine the forecast trajectories together with the reconstructed observed trajectories. We demonstrate that STC-Net generates spatially and temporally consistent trajectories and outperforms other graph-...
Sensors
In order to enable timely actions to prevent major losses of crops caused by lack of nutrients an... more In order to enable timely actions to prevent major losses of crops caused by lack of nutrients and, hence, increase the potential yield throughout the growing season while at the same time prevent excess fertilization with detrimental environmental consequences, early, non-invasive, and on-site detection of nutrient deficiency is required. Current non-invasive methods for assessing the nutrient status of crops deal in most cases with nitrogen (N) deficiency only and optical sensors to diagnose N deficiency, such as chlorophyll meters or canopy reflectance sensors, do not monitor N, but instead measure changes in leaf spectral properties that may or may not be caused by N deficiency. In this work, we study how well nutrient deficiency symptoms can be recognized in RGB images of sugar beets. To this end, we collected the Deep Nutrient Deficiency for Sugar Beet (DND-SB) dataset, which contains 5648 images of sugar beets growing on a long-term fertilizer experiment with nutrient deficie...
As a natural extension of the image synthesis task, video synthesis has attracted a lot of intere... more As a natural extension of the image synthesis task, video synthesis has attracted a lot of interest recently. Many image synthesis works utilize class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when an action starts or ends. To overcome this limitation, we introduce semantic video scene graphs as input for video synthesis, as they represent the spatial and temporal relationships between objects in the scene. Since video scene graphs are usually temporally discrete annotations, we propose a video scene graph (VSG) encoder that not only encodes the existing video scene graphs but also predicts the graph representations for unlabeled frames. The VSG encoder is pretrained with different contrastive multi-modal losses. A semantic scene graph-to-video synthesis framework (SSGVS), based on the pre-trained VSG encoder, VQ-VAE, and autoregressive Transformer, is proposed to synthesize a video given an initial scene image and a non-fixed number of semantic scene graphs. We evaluate SSGVS and other stateof-the-art video synthesis models on the Action Genome dataset and demonstrate the positive significance of video scene graphs in video synthesis. The source code is available at https://github.com/yrcong/SSGVS.
IEEE Robotics and Automation Letters
With the advances in capturing 2D or 3D skeleton data, skeleton-based action recognition has rece... more With the advances in capturing 2D or 3D skeleton data, skeleton-based action recognition has received an increasing interest over the last years. As skeleton data is commonly represented by graphs, graph convolutional networks have been proposed for this task. While current graph convolutional networks accurately recognize actions, they are too expensive for robotics applications where limited computational resources are available. In this paper, we therefore propose a highly efficient graph convolutional network that addresses the limitations of previous works. This is achieved by a parallel structure that gradually fuses motion and spatial information and by reducing the temporal resolution as early as possible. Furthermore, we explicitly address the issue that human poses can contain errors. To this end, the network first refines the poses before they are further processed to recognize the action. We therefore call the network Pose Refinement Graph Convolutional Network. Compared to other graph convolutional networks, our network requires 86%-93% less parameters and reduces the floating point operations by 89%-96% while achieving a comparable accuracy. It therefore provides a much better trade-off between accuracy, memory footprint and processing time, which makes it suitable for robotics applications.
Trajectory forecasting is a crucial step for autonomous vehicles and mobile robots in order to na... more Trajectory forecasting is a crucial step for autonomous vehicles and mobile robots in order to navigate and interact safely. In order to handle the spatial interactions between objects, graph-based approaches have been proposed. These methods, however, model motion on a frameto-frame basis and do not provide a strong temporal model. To overcome this limitation, we propose a compact model called Spatial-Temporal Consistency Network (STC-Net). In STC-Net, dilated temporal convolutions are introduced to model long-range dependencies along each trajectory for better temporal modeling while graph convolutions are employed to model the spatial interaction among different trajectories. Furthermore, we propose a feature-wise convolution to generate the predicted trajectories in one pass and refine the forecast trajectories together with the reconstructed observed trajectories. We demonstrate that STC-Net generates spatially and temporally consistent trajectories and outperforms other graph-...
Sensors
In order to enable timely actions to prevent major losses of crops caused by lack of nutrients an... more In order to enable timely actions to prevent major losses of crops caused by lack of nutrients and, hence, increase the potential yield throughout the growing season while at the same time prevent excess fertilization with detrimental environmental consequences, early, non-invasive, and on-site detection of nutrient deficiency is required. Current non-invasive methods for assessing the nutrient status of crops deal in most cases with nitrogen (N) deficiency only and optical sensors to diagnose N deficiency, such as chlorophyll meters or canopy reflectance sensors, do not monitor N, but instead measure changes in leaf spectral properties that may or may not be caused by N deficiency. In this work, we study how well nutrient deficiency symptoms can be recognized in RGB images of sugar beets. To this end, we collected the Deep Nutrient Deficiency for Sugar Beet (DND-SB) dataset, which contains 5648 images of sugar beets growing on a long-term fertilizer experiment with nutrient deficie...