Farley Lai - Academia.edu (original) (raw)
Papers by Farley Lai
In this paper, we introduce a contextual grounding approach that captures the context in correspo... more In this paper, we introduce a contextual grounding approach that captures the context in corresponding text entities and image regions to improve the grounding accuracy. Specifically, the proposed architecture accepts pre-trained text token embeddings and image object features from an off-the-shelf object detector as input. Additional encoding to capture the positional and spatial information can be added to enhance the feature quality. There are separate text and image branches facilitating respective architectural refinements for different modalities. The text branch is pre-trained on a large-scale masked language modeling task while the image branch is trained from scratch. Next, the model learns the contextual representations of the text tokens and image objects through layers of high-order interaction respectively. The final grounding head ranks the correspondence between the textual and visual representations through cross-modal interaction. In the evaluation, we show that our...
Pose tracking is an important problem that requires identifying unique human pose-instances and m... more Pose tracking is an important problem that requires identifying unique human pose-instances and matching them temporally across different frames of a video. However, existing pose tracking methods are unable to accurately model temporal relationships and require significant computation, often computing the tracks offline. We present an efficient Multi-person Pose Tracking method, KeyTrack, that only relies on keypoint information without using any RGB or optical flow information to track human keypoints in real-time. Keypoints are tracked using our Pose Entailment method, in which, first, a pair of pose estimates is sampled from different frames in a video and tokenized. Then, a Transformer-based network makes a binary classification as to whether one pose temporally follows another. Furthermore, we improve our top-down pose estimation method with a novel, parameter-free, keypoint refinement technique that improves the keypoint estimates used during the Pose Entailment step. We achi...
AudioSense integrates mobile phones and web technology to measure hearing aid performance in real... more AudioSense integrates mobile phones and web technology to measure hearing aid performance in real-time and in-situ. Measuring the performance of hearing aids in the real world poses significant challenges as it depends on the patient's listening context. AudioSense uses Ecological Momentary Assessment methods to evaluate both the perceived hearing aid performance as well as to characterize the listening environment using electronic surveys. AudioSense further characterizes a patient's listening context by recording their GPS location and sound samples. By creating a time-synchronized record of listening performance and listening contexts, AudioSense will allow researchers to understand the relationship between listening context and hearing aid performance. Performance evaluation shows that AudioSense is reliable, energy-efficient, and can estimate Signal-to-Noise Ratio (SNR) levels from captured audio samples.
Pose tracking is an important problem that requires identifying unique human pose-instances and m... more Pose tracking is an important problem that requires identifying unique human pose-instances and matching them temporally across different frames of a video. However, existing pose tracking methods are unable to accurately model temporal relationships and require significant computation, often computing the tracks offline. We present an efficient multi-person pose tracking method, KeyTrack, that only relies on keypoint information without using any RGB or optical flow information to track human keypoints in real-time. Keypoints are tracked using our Pose Entailment method, in which, first, a pair of pose estimates is sampled from different frames in a video and tokenized. Then, a Transformer-based network makes a binary classification as to whether one pose temporally follows another. Furthermore, we improve our top-down pose estimation method with a novel, parameter-free, keypoint refinement technique that improves the keypoint estimates used during the Pose Entailment step. We achi...
ArXiv, 2018
We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textu... more We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE (publicly available at this https URL) is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30k. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.
ArXiv, 2019
Existing visual reasoning datasets such as Visual Question Answering (VQA), often suffer from bia... more Existing visual reasoning datasets such as Visual Question Answering (VQA), often suffer from biases conditioned on the question, image or answer distributions. The recently proposed CLEVR dataset addresses these limitations and requires fine-grained reasoning but the dataset is synthetic and consists of similar objects and sentence structures across the dataset. In this paper, we introduce a new inference task, Visual Entailment (VE) - consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal of a trained VE model is to predict whether the image semantically entails the text. To realize this task, we build a dataset SNLI-VE based on the Stanford Natural Language Inference corpus and Flickr30k dataset. We evaluate various existing VQA baselines and build a model called Explainable Visual Entailment (EVE) system to address the VE task. EVE achieves up to 71% accuracy and ou...
ArXiv, 2019
In this paper, we introduce a contextual grounding approach that captures the context in correspo... more In this paper, we introduce a contextual grounding approach that captures the context in corresponding text entities and image regions to improve the grounding accuracy. Specifically, the proposed architecture accepts pre-trained text token embeddings and image object features from an off-the-shelf object detector as input. Additional encoding to capture the positional and spatial information can be added to enhance the feature quality. There are separate text and image branches facilitating respective architectural refinements for different modalities. The text branch is pre-trained on a large-scale masked language modeling task while the image branch is trained from scratch. Next, the model learns the contextual representations of the text tokens and image objects through layers of high-order interaction respectively. The final grounding head ranks the correspondence between the textual and visual representations through cross-modal interaction. In the evaluation, we show that our...
2018 IEEE/ACM Third International Conference on Internet-of-Things Design and Implementation (IoTDI), 2018
Energy-efficiency is a key concern in mobile sensing applications, such as those for tracking soc... more Energy-efficiency is a key concern in mobile sensing applications, such as those for tracking social interactions or physical activities. An attractive approach to saving energy is to shape the workload of the system by artificially introducing delays so that the workload would require less energy to process. However, adding delays to save energy may have a detrimental impact on user experience. To address this problem, we present Gratis, a novel paradigm for incorporating workload shaping energy optimizations in mobile sensing applications in an automated manner. Gratis adopts stream programs as a high-level abstraction whose execution is coordinated using an explicit power management policy. We present an expressive coordination language that can specify a broad range of workload-shaping optimizations. A unique property of the proposed power management policies is that they have predictable performance, which can be estimated at compile time, in a computationally efficient manner,...
This paper considers the problem of spatiotemporal object-centric reasoning in videos. Central to... more This paper considers the problem of spatiotemporal object-centric reasoning in videos. Central to our approach is the notion of object permanence, i.e., the ability to reason about the location of objects as they move through the video while being occluded, contained or carried by other objects. Existing deep learning based approaches often suffer from spatiotemporal biases when applied to video reasoning problems. We propose Hopper, which uses a Multi-hop Transformer for reasoning object permanence in videos. Given a video and a localization query, Hopper reasons over image and object tracks to automatically hop over critical frames in an iterative fashion to predict the final position of the object of interest. We demonstrate the effectiveness of using a contrastive loss to reduce spatiotemporal biases. We evaluate over CATER dataset and find that Hopper achieves 73.2% Top-1 accuracy using just 1 FPS by hopping through just a few critical frames. We also demonstrate Hopper can per...
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
5.6 Simulation of three domains. Domain 1 and 2 use the CPU and have delays of 10 and 20, respect... more 5.6 Simulation of three domains. Domain 1 and 2 use the CPU and have delays of 10 and 20, respectively. Domain 3 uses the network. Domains 1-2 and domain 3 execute in parallel since they use different hardware resources. Domains 1 and 2 use the CPU fairly. 5.7 The energy-delay trade-off for SI and AR when using static sensing. Batching significantly improves energy efficiency. Combining batching with scheduled concurrency provides no additional benefit.
2015 International Conference on Embedded Software (EMSOFT), 2015
AudioSense integrates mobile phones and web technology to measure hearing aid performance in real... more AudioSense integrates mobile phones and web technology to measure hearing aid performance in real-time and in-situ. Measuring the performance of hearing aids in the real world poses significant challenges as it depends on the patient's listening context. AudioSense uses Ecological Momentary Assessment methods to evaluate both the perceived hearing aid performance as well as to characterize the listening environment using electronic surveys. AudioSense further characterizes a patient's listening context by recording their GPS location and sound samples. By creating a time-synchronized record of listening performance and listening contexts, AudioSense will allow researchers to understand the relationship between listening context and hearing aid performance. Performance evaluation shows that AudioSense is reliable, energy-efficient, and can estimate Signal-to-Noise Ratio (SNR) levels from captured audio samples.
IPSN-14 Proceedings of the 13th International Symposium on Information Processing in Sensor Networks, 2014
ABSTRACT This paper presents CSense - a stream-processing toolkit for developing robust and high-... more ABSTRACT This paper presents CSense - a stream-processing toolkit for developing robust and high-rate mobile sensing application in Java. CSense addresses the needs of these systems by providing a new programming model that supports flexible application configuration, a high-level concurrency model, memory management, and compiler analyses and optimizations. Our compiler includes a novel flow analysis that optimizes the exchange of data across components from an application-wide perspective. A mobile sensing application benchmark indicates that flow analysis may reduce CPU utilization by as much as 45%. Static analysis is used to detect a range of programming errors including application composition errors, improper use of memory management, and data races. We identify that memory management and concurrency limit the scalability of stream processing systems. We incorporate memory pools, frame conversion optimizations, and custom synchronization primitives to develop a scalable run-time. CSense is evaluated on Galaxy Nexus phones running Android. Empirical results indicate that our run-time achieves 19 times higher steam processing rate compared to a realistic baseline implementation. We demonstrate the versatility of CSense by developing three mobile sensing applications.
In this paper, we introduce a contextual grounding approach that captures the context in correspo... more In this paper, we introduce a contextual grounding approach that captures the context in corresponding text entities and image regions to improve the grounding accuracy. Specifically, the proposed architecture accepts pre-trained text token embeddings and image object features from an off-the-shelf object detector as input. Additional encoding to capture the positional and spatial information can be added to enhance the feature quality. There are separate text and image branches facilitating respective architectural refinements for different modalities. The text branch is pre-trained on a large-scale masked language modeling task while the image branch is trained from scratch. Next, the model learns the contextual representations of the text tokens and image objects through layers of high-order interaction respectively. The final grounding head ranks the correspondence between the textual and visual representations through cross-modal interaction. In the evaluation, we show that our...
Pose tracking is an important problem that requires identifying unique human pose-instances and m... more Pose tracking is an important problem that requires identifying unique human pose-instances and matching them temporally across different frames of a video. However, existing pose tracking methods are unable to accurately model temporal relationships and require significant computation, often computing the tracks offline. We present an efficient Multi-person Pose Tracking method, KeyTrack, that only relies on keypoint information without using any RGB or optical flow information to track human keypoints in real-time. Keypoints are tracked using our Pose Entailment method, in which, first, a pair of pose estimates is sampled from different frames in a video and tokenized. Then, a Transformer-based network makes a binary classification as to whether one pose temporally follows another. Furthermore, we improve our top-down pose estimation method with a novel, parameter-free, keypoint refinement technique that improves the keypoint estimates used during the Pose Entailment step. We achi...
AudioSense integrates mobile phones and web technology to measure hearing aid performance in real... more AudioSense integrates mobile phones and web technology to measure hearing aid performance in real-time and in-situ. Measuring the performance of hearing aids in the real world poses significant challenges as it depends on the patient's listening context. AudioSense uses Ecological Momentary Assessment methods to evaluate both the perceived hearing aid performance as well as to characterize the listening environment using electronic surveys. AudioSense further characterizes a patient's listening context by recording their GPS location and sound samples. By creating a time-synchronized record of listening performance and listening contexts, AudioSense will allow researchers to understand the relationship between listening context and hearing aid performance. Performance evaluation shows that AudioSense is reliable, energy-efficient, and can estimate Signal-to-Noise Ratio (SNR) levels from captured audio samples.
Pose tracking is an important problem that requires identifying unique human pose-instances and m... more Pose tracking is an important problem that requires identifying unique human pose-instances and matching them temporally across different frames of a video. However, existing pose tracking methods are unable to accurately model temporal relationships and require significant computation, often computing the tracks offline. We present an efficient multi-person pose tracking method, KeyTrack, that only relies on keypoint information without using any RGB or optical flow information to track human keypoints in real-time. Keypoints are tracked using our Pose Entailment method, in which, first, a pair of pose estimates is sampled from different frames in a video and tokenized. Then, a Transformer-based network makes a binary classification as to whether one pose temporally follows another. Furthermore, we improve our top-down pose estimation method with a novel, parameter-free, keypoint refinement technique that improves the keypoint estimates used during the Pose Entailment step. We achi...
ArXiv, 2018
We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textu... more We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE (publicly available at this https URL) is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30k. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.
ArXiv, 2019
Existing visual reasoning datasets such as Visual Question Answering (VQA), often suffer from bia... more Existing visual reasoning datasets such as Visual Question Answering (VQA), often suffer from biases conditioned on the question, image or answer distributions. The recently proposed CLEVR dataset addresses these limitations and requires fine-grained reasoning but the dataset is synthetic and consists of similar objects and sentence structures across the dataset. In this paper, we introduce a new inference task, Visual Entailment (VE) - consisting of image-sentence pairs whereby a premise is defined by an image, rather than a natural language sentence as in traditional Textual Entailment tasks. The goal of a trained VE model is to predict whether the image semantically entails the text. To realize this task, we build a dataset SNLI-VE based on the Stanford Natural Language Inference corpus and Flickr30k dataset. We evaluate various existing VQA baselines and build a model called Explainable Visual Entailment (EVE) system to address the VE task. EVE achieves up to 71% accuracy and ou...
ArXiv, 2019
In this paper, we introduce a contextual grounding approach that captures the context in correspo... more In this paper, we introduce a contextual grounding approach that captures the context in corresponding text entities and image regions to improve the grounding accuracy. Specifically, the proposed architecture accepts pre-trained text token embeddings and image object features from an off-the-shelf object detector as input. Additional encoding to capture the positional and spatial information can be added to enhance the feature quality. There are separate text and image branches facilitating respective architectural refinements for different modalities. The text branch is pre-trained on a large-scale masked language modeling task while the image branch is trained from scratch. Next, the model learns the contextual representations of the text tokens and image objects through layers of high-order interaction respectively. The final grounding head ranks the correspondence between the textual and visual representations through cross-modal interaction. In the evaluation, we show that our...
2018 IEEE/ACM Third International Conference on Internet-of-Things Design and Implementation (IoTDI), 2018
Energy-efficiency is a key concern in mobile sensing applications, such as those for tracking soc... more Energy-efficiency is a key concern in mobile sensing applications, such as those for tracking social interactions or physical activities. An attractive approach to saving energy is to shape the workload of the system by artificially introducing delays so that the workload would require less energy to process. However, adding delays to save energy may have a detrimental impact on user experience. To address this problem, we present Gratis, a novel paradigm for incorporating workload shaping energy optimizations in mobile sensing applications in an automated manner. Gratis adopts stream programs as a high-level abstraction whose execution is coordinated using an explicit power management policy. We present an expressive coordination language that can specify a broad range of workload-shaping optimizations. A unique property of the proposed power management policies is that they have predictable performance, which can be estimated at compile time, in a computationally efficient manner,...
This paper considers the problem of spatiotemporal object-centric reasoning in videos. Central to... more This paper considers the problem of spatiotemporal object-centric reasoning in videos. Central to our approach is the notion of object permanence, i.e., the ability to reason about the location of objects as they move through the video while being occluded, contained or carried by other objects. Existing deep learning based approaches often suffer from spatiotemporal biases when applied to video reasoning problems. We propose Hopper, which uses a Multi-hop Transformer for reasoning object permanence in videos. Given a video and a localization query, Hopper reasons over image and object tracks to automatically hop over critical frames in an iterative fashion to predict the final position of the object of interest. We demonstrate the effectiveness of using a contrastive loss to reduce spatiotemporal biases. We evaluate over CATER dataset and find that Hopper achieves 73.2% Top-1 accuracy using just 1 FPS by hopping through just a few critical frames. We also demonstrate Hopper can per...
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
5.6 Simulation of three domains. Domain 1 and 2 use the CPU and have delays of 10 and 20, respect... more 5.6 Simulation of three domains. Domain 1 and 2 use the CPU and have delays of 10 and 20, respectively. Domain 3 uses the network. Domains 1-2 and domain 3 execute in parallel since they use different hardware resources. Domains 1 and 2 use the CPU fairly. 5.7 The energy-delay trade-off for SI and AR when using static sensing. Batching significantly improves energy efficiency. Combining batching with scheduled concurrency provides no additional benefit.
2015 International Conference on Embedded Software (EMSOFT), 2015
AudioSense integrates mobile phones and web technology to measure hearing aid performance in real... more AudioSense integrates mobile phones and web technology to measure hearing aid performance in real-time and in-situ. Measuring the performance of hearing aids in the real world poses significant challenges as it depends on the patient's listening context. AudioSense uses Ecological Momentary Assessment methods to evaluate both the perceived hearing aid performance as well as to characterize the listening environment using electronic surveys. AudioSense further characterizes a patient's listening context by recording their GPS location and sound samples. By creating a time-synchronized record of listening performance and listening contexts, AudioSense will allow researchers to understand the relationship between listening context and hearing aid performance. Performance evaluation shows that AudioSense is reliable, energy-efficient, and can estimate Signal-to-Noise Ratio (SNR) levels from captured audio samples.
IPSN-14 Proceedings of the 13th International Symposium on Information Processing in Sensor Networks, 2014
ABSTRACT This paper presents CSense - a stream-processing toolkit for developing robust and high-... more ABSTRACT This paper presents CSense - a stream-processing toolkit for developing robust and high-rate mobile sensing application in Java. CSense addresses the needs of these systems by providing a new programming model that supports flexible application configuration, a high-level concurrency model, memory management, and compiler analyses and optimizations. Our compiler includes a novel flow analysis that optimizes the exchange of data across components from an application-wide perspective. A mobile sensing application benchmark indicates that flow analysis may reduce CPU utilization by as much as 45%. Static analysis is used to detect a range of programming errors including application composition errors, improper use of memory management, and data races. We identify that memory management and concurrency limit the scalability of stream processing systems. We incorporate memory pools, frame conversion optimizations, and custom synchronization primitives to develop a scalable run-time. CSense is evaluated on Galaxy Nexus phones running Android. Empirical results indicate that our run-time achieves 19 times higher steam processing rate compared to a realistic baseline implementation. We demonstrate the versatility of CSense by developing three mobile sensing applications.