Yi Zhu | University of California, Merced (original) (raw)

Papers by Yi Zhu

CVPRW, 2017

We study the unsupervised learning of CNNs for optical flow estimation using proxy ground truth d... more We study the unsupervised learning of CNNs for optical flow estimation using proxy ground truth data. Supervised CNNs, due to their immense learning capacity, have shown superior performance on a range of computer vision problems including optical flow prediction. They however require the ground truth flow which is usually not accessible except on limited synthetic data. Without the guidance of ground truth optical flow, unsupervised CNNs often perform worse as they are naturally ill-conditioned. We therefore propose a novel framework in which proxy ground truth data generated from classical approaches is used to guide the CNN learning. The models are further refined in an unsupervised fashion using an image reconstruction loss. Our guided learning approach is competitive with or superior to state-of-the-art approaches on three standard benchmark datasets yet is completely unsupervised and can run in real time.

WACV, 2017

This paper studies the joint learning of action recognition and temporal localization in long, un... more This paper studies the joint learning of action recognition and temporal localization in long, untrimmed videos. We employ a multi-task learning framework that performs the three highly related steps of action proposal, action recognition, and action localization refinement in parallel instead of the standard sequential pipeline that performs the steps in order. We develop a novel temporal actionness regression module that estimates what proportion of a clip contains action. We use it for temporal localization but it could have other applications like video retrieval, surveillance, summarization, etc. We also introduce random shear augmentation during training to simulate viewpoint change. We evaluate our framework on three popular video benchmarks. Results demonstrate that our joint model is efficient in terms of storage and computation in that we do not need to compute and cache dense trajectory features, and that it is several times faster than its sequential ConvNets counterpart. Yet, despite being more efficient, it outperforms state-of-the-art methods with respect to accuracy.

ICIP, 2018

In recent years, geotagged social media has become popular as a novel source for geographic knowl... more In recent years, geotagged social media has become popular as a novel source for geographic knowledge discovery. Ground-level images and videos provide a different perspective than overhead imagery and can be applied to a range of applications such as land use mapping, activity detection, pollution mapping, etc. The sparse and uneven distribution of this data presents a problem, however, for generating dense maps. We therefore investigate the problem of spatially interpolating the high-dimensional features extracted from sparse social media to enable dense labeling using standard classifiers. Further, we show how prior knowledge about region boundaries can be used to improve the interpolation through spatial morphing kernel regression. We show that an interpolate-then-classify framework can produce dense maps from sparse observations but that care must be taken in choosing the interpolation method. We also show that the spatial morphing kernel improves the results.

ICIP, 2018

ICIP, 2017

Classical approaches for estimating optical flow have achieved rapid progress in the last decade.... more Classical approaches for estimating optical flow have achieved rapid progress in the last decade. However, most of them are too slow to be applied in real-time video analysis. Due to the great success of deep learning, recent work has focused on using CNNs to solve such dense prediction problems. In this paper, we investigate a new deep architecture, Densely Connected Convolutional Networks (DenseNet), to learn optical flow. This specific architecture is ideal for the problem at hand as it provides shortcut connections throughout the network, which leads to implicit deep supervision. We extend current DenseNet to a fully convolutional network to learn motion estimation in an unsupervised manner. Evaluation results on three standard benchmarks demonstrate that DenseNet is a better fit than other widely adopted CNN architectures for optical flow estimation.

CVPR, 2017

We investigate the problem of representing an entire video using CNN features for human action re... more We investigate the problem of representing an entire video using CNN features for human action recognition. Currently, limited by GPU memory, we have not been able to feed a whole video into CNN/RNNs for end-to-end learning. A common practice is to use sampled frames as inputs and video labels as supervision. One major problem of this popular approach is that the local samples may not contain the information indicated by global labels. To deal with this problem, we propose to treat the deep networks trained on local inputs as local feature extractors. After extracting local features, we aggregate them into global features and train another mapping function on the same training data to map the global features into global labels. We study a set of problems regarding this new type of local features such as how to aggregate them into global features. Experimental results on HMDB51 and UCF101 datasets show that, for these new local features, a simple maximum pooling on the sparsely sampled features lead to significant performance improvement.

Transactions on Multimedia, 2019

We perform fine-grained land use mapping at the city scale using ground-level images. Mapping lan... more We perform fine-grained land use mapping at the city scale using ground-level images. Mapping land use is considerably more difficult than mapping land cover and is generally not possible using overhead imagery as it requires close-up views and seeing inside buildings. We postulate that the growing collections of georeferenced, ground-level images suggest an alternate approach to this geographic knowledge discovery problem. We develop a general framework that uses Flickr images to map 45 different land-use classes for the City of San Francisco. Individual images are classified using a novel convolutional neural network containing two streams, one for recognizing objects and another for recognizing scenes. This network is trained in an end-to-end manner directly on the labeled training images. We propose several strategies to overcome the noisiness of our user-generated data including search-based training set augmentation and online adaptive training. We derive a ground truth map of San Francisco in order to evaluate our method. We demonstrate the effectiveness of our approach through geo-visualization and quantitative analysis. Our framework achieves over 29% recall at the individual land parcel level which represents a strong baseline for the challenging 45-way land use classification problem especially given the noisiness of the image data.

ACM SIGSPATIAL, 2018

This paper investigates conditional generative adversarial networks (cGANs) to overcome a fundame... more This paper investigates conditional generative adversarial networks (cGANs) to overcome a fundamental limitation of using geotagged media for geographic discovery, namely its sparse and uneven spatial distribution. We train a cGAN to generate ground-level views of a location given overhead imagery. We show the "fake" ground-level images are natural looking and are structurally similar to the real images. More significantly, we show the generated images are representative of the locations and that the representations learned by the cGANs are informative. In particular, we show that dense feature maps generated using our framework are more effective for land-cover classification than approaches which spatially interpolate features extracted from sparse ground-level images. To our knowledge, ours is the first work to use cGANs to generate ground-level views given overhead imagery and to explore the benefits of the learned representations.

ACM SIGSPATIAL, 2017

This paper is the first work to perform spatio-temporal mapping of human activity using the visua... more This paper is the first work to perform spatio-temporal mapping of human activity using the visual content of geo-tagged videos. We utilize a recent deep-learning based video analysis framework, termed hidden two-stream networks, to recognize a range of activities in YouTube videos. This framework is efficient and can run in real time or faster which is important for recognizing events as they occur in streaming video or for reducing latency in analyzing already captured video. This is, in turn, important for using video in smart-city applications. We perform a series of experiments to show our approach is able to accurately map activities both spatially and temporally. We also demonstrate the advantages of using the visual content over the tags/titles.

ACM SIGSPATIAL, 2016

We perform spatio-temporal analysis of public sentiment using geotagged photo collections. We dev... more We perform spatio-temporal analysis of public sentiment using geotagged photo collections. We develop a deep learning-based classifier that predicts the emotion conveyed by an image. This allows us to associate sentiment with place. We perform spatial hotspot detection and show that different emotions have distinct spatial distributions that match expectations. We also perform temporal analysis using the capture time of the photos. Our spatio-temporal hotspot detection correctly identifies emerging concentrations of specific emotions and year-by-year analyses of select locations show there are strong temporal correlations between the predicted emotions and known events.

ECCV, 2016

This paper performs the first investigation into depth for large-scale human action recognition i... more This paper performs the first investigation into depth for large-scale human action recognition in video where the depth cues are estimated from the videos themselves. We develop a new framework called depth2action and experiment thoroughly into how best to incorporate the depth information. We introduce spatio-temporal depth normalization (STDN) to enforce temporal consistency in our estimated depth sequences. We also propose modified depth motion maps (MDMM) to capture the subtle temporal changes in depth. These two components significantly improve the action recognition performance. We evaluate our depth2action framework on three large-scale action recognition video benchmarks. Our model achieves state-of-the-art performance when combined with appearance and motion information thus demonstrating that depth2action is indeed complementary to existing approaches.

ACCV, 2018

Analyzing videos of human actions involves understanding the temporal relationships among video f... more Analyzing videos of human actions involves understanding the temporal relationships among video frames. State-of-the-art action recognition approaches rely on traditional optical flow estimation methods to pre-compute motion information for CNNs. Such a two-stage approach is computationally expensive, storage demanding, and not end-to-end trainable. In this paper, we present a novel CNN architecture that implicitly captures motion information between adjacent frames. We name our approach hidden two-stream CNNs because it only takes raw video frames as input and directly predicts action classes without explicitly computing optical flow. Our end-to-end approach is 10x faster than its two-stage baseline. Experimental results on four challenging action recognition datasets: UCF101, HMDB51, THUMOS14 and ActivityNet v1.2 show that our approach significantly outperforms the previous best real-time approaches.

ACCV, 2018

Deep neural networks have led to a series of breakthroughs in computer vision given sufficient an... more Deep neural networks have led to a series of breakthroughs in computer vision given sufficient annotated training datasets. For novel tasks with limited labeled data, the prevalent approach is to transfer the knowledge learned in the pre-trained models to the new tasks by fine-tuning. Classic model fine-tuning utilizes the fact that well trained neural networks appear to learn cross domain features. These features are treated equally during transfer learning. In this paper, we explore the impact of feature selection in model fine-tuning by introducing a transfer module, which assigns weights to features extracted from pre-trained models. The proposed transfer module proves the importance of feature selection for transferring models from source to target domains. It is shown to significantly improve upon fine-tuning results with only marginal extra computational cost. We also incorporate an auxiliary classifier as an extra regularizer to avoid over-fitting. Finally, we build a Gated Transfer Network (GTN) based on our transfer module and achieve state-of-the-art results on six different tasks.

ACCV, 2018

Current state-of-the-art approaches to video understanding adopt temporal jittering to simulate a... more Current state-of-the-art approaches to video understanding adopt temporal jittering to simulate analyzing the video at varying frame rates. However, this does not work well for multirate videos, in which actions or subactions occur at different speeds. The frame sampling rate should vary in accordance with the different motion speeds. In this work, we propose a simple yet effective strategy, termed random temporal skipping, to address this situation. This strategy effectively handles multirate videos by randomizing the sampling rate during training. It is an exhaustive approach, which can potentially cover all motion speed variations. Furthermore, due to the large temporal skipping, our network can see video clips that originally cover over 100 frames. Such a time range is enough to analyze most actions/events. We also introduce an occlusion-aware optical flow learning method that generates improved motion maps for human action recognition. Our framework is end-to-end trainable, runs in real-time, and achieves state-of-the-art performance on six widely adopted video benchmarks.

CVPR, 2018

Unseen Action Recognition (UAR) aims to recognise novel action categories without training exampl... more Unseen Action Recognition (UAR) aims to recognise novel action categories without training examples. While previous methods focus on inner-dataset seen/unseen splits, this paper proposes a pipeline using a large-scale training source to achieve a Universal Representation (UR) that can generalise to a more realistic Cross-Dataset UAR (CD-UAR) scenario. We first address UAR as a Generalised Multiple-Instance Learning (GMIL) problem and discover 'building-blocks' from the large-scale ActivityNet dataset using distribution kernels. Essential visual and semantic components are preserved in a shared space to achieve the UR that can efficiently generalise to new datasets. Predicted UR exemplars can be improved by a simple semantic adaptation, and then an unseen action can be directly recognised using UR during the test. Without further training, extensive experiments manifest significant improvements over the UCF101 and HMDB51 benchmarks.

In this work, we implement a real-time human action recognition framework, termed hidden two-stre... more In this work, we implement a real-time human action recognition framework, termed hidden two-stream networks [1]. This method only takes raw video frames as input and directly predicts action classes without explicitly computing optical flow. Thus it can avoid the computationally expensive pre-computation step for standard two-stream approaches. Here, we first reproduce its result on UCF101 dataset, and then extend it to another large-scale dataset named ActivityNet. We extend [1] by pre-training both the MotionNet and two-stream CNNs on Kinetics, a recently released large-scale action recognition dataset. We show that a decent initialization can boost the recognition accuracy. In addition, due to the large network footprint of VGG16, we explore other recently proposed deep network architectures such as DenseNet169. Finally, we achieve promising result on both datasets and we run much faster than real-time requirement of 25 frames per second.

ACM SIGSPATIAL, 2015

Land use mapping is a fundamental yet challenging task in geographic science. In contrast to land... more Land use mapping is a fundamental yet challenging task in geographic science. In contrast to land cover mapping, it is generally not possible using overhead imagery. The recent, explosive growth of online geo-referenced photo collections suggests an alternate approach to geographic knowledge discovery. In this work, we present a general framework that uses ground-level images from Flickr for land use mapping.

Mobile data traffic is expected to have an exponential growth in the future. In order to meet the... more Mobile data traffic is expected to have an exponential growth in the future. In order to meet the challenge as well as the form factor limitation on the base station, twodimensional (2D) "massive MIMO" has been proposed as one of the enabling technologies for future wireless systems. In 2D "massive MIMO" systems, a base station will rely on the uplink sounding signals to figure out the downlink spatial channel information to perform MIMO precoding. Accordingly, directionof-arrival (DoA) estimation of the underlying three-dimensional (3D) channel at the base station becomes essential for 2D "massive MIMO" systems to realize the predicted capacity gains.

Proceedings of the IEEE Global Communications Conference

Mobile data traffic is expected to have an exponential growth in the future. In order to meet the... more Mobile data traffic is expected to have an exponential growth in the future. In order to meet the challenge as well as the form factor limitation at the base station, 2D "Massive MIMO", combined with OFDM, has been proposed as one of the enabling technologies to significantly increase the spectral efficiency of a broadband wireless system. In 2D broadband MIMO-OFDM systems, a base station will rely on the spatial information extracted from uplink sounding reference signals to perform downlink MIMO beam-forming. Accordingly, multi-dimensional parameter estimation of a ray-based multipath wireless channel becomes crucial for such systems to realize the predicted capacity gains.

CVPRW, 2017

WACV, 2017

ICIP, 2018

ICIP, 2017

CVPR, 2017

Transactions on Multimedia, 2019

ACM SIGSPATIAL, 2018

ACM SIGSPATIAL, 2017

ACM SIGSPATIAL, 2016

ECCV, 2016

ACCV, 2018

CVPR, 2018

ACM SIGSPATIAL, 2015

Proceedings of the IEEE Global Communications Conference

Mobile data traffic is expected to have an exponential growth in the future. In order to meet the... more Mobile data traffic is expected to have an exponential growth in the future. In order to meet the challenge as well as the form factor limitation at the base station, 2D "Massive MIMO", combined with OFDM, has been proposed as one of the enabling technologies to significantly increase the spectral efficiency of a broadband wireless system. In 2D broadband MIMO-OFDM systems, a base station will rely on the spatial information extracted from uplink sounding reference signals to perform downlink MIMO beam-forming. Accordingly, multi-dimensional parameter estimation of a ray-based multipath wireless channel becomes crucial for such systems to realize the predicted capacity gains.