Thomas R Manzini | Carnegie Mellon University (original) (raw)

Uploads

Papers by Thomas R Manzini

AAAI 2019, 2019

Multimodal sentiment analysis is a core research area that studies speaker sentiment expressed fr... more Multimodal sentiment analysis is a core research area that studies speaker sentiment expressed from the language, visual, and acoustic modalities. The central challenge in multimodal learning involves inferring joint representations that can process and relate information from these modalities. However, existing work learns joint representations by requiring all modalities as input and as a result, the learned representations may be sensitive to noisy or missing modalities at test time. With the recent success of sequence to sequence (Seq2Seq) models in machine translation, there is an opportunity to explore new ways of learning joint representations that may not require all input modalities at test time. In this paper, we propose a method to learn robust joint representations by translating between modalities. Our method is based on the key insight that translation from a source to a target modality provides a method of learning joint representations using only the source modality as input. We augment modality translations with a cycle consistency loss to ensure that our joint representations retain maximal information from all modalities. Once our translation model is trained with paired multimodal data, we only need data from the source modality at test time for final sentiment prediction. This ensures that our model remains robust from perturbations or missing information in the other modalities. We train our model with a coupled translation-prediction objective and it achieves new state-of-the-art results on multimodal sentiment analysis datasets: CMU-MOSI, ICT-MMMO, and YouTube. Additional experiments show that our model learns increasingly discriminative joint representations with more input modalities while maintaining robustness to missing or perturbed modalities.

NAACL-HLT 2019, 2019

Online texts—across genres, registers, domains, and styles—are riddled with human stereotypes, ex... more Online texts—across genres, registers, domains, and styles—are riddled with human stereotypes, expressed in overt or subtle ways. Word embeddings, trained on these texts, perpetuate and amplify these stereotypes, and propagate biases to machine learning models that use word embeddings as features. In this work, we propose a method to debias word embeddings in multiclass settings such as race
and religion, extending the work of (Bolukbasi et al., 2016) from the binary setting, such as binary gender. Next, we propose a novel methodology for the evaluation of multiclass debiasing. We demonstrate that our multiclass debiasing is robust and maintains the efficacy in standard NLP tasks.

This paper explores how different synthetic speech systems can be understood in a noisy environme... more This paper explores how different synthetic speech systems can be understood in a noisy environment that resembles radio noise. This work is motivated by a need for intelligible speech in noisy environments such as emergency response and disaster notification. We discuss prior work done on listening tasks as well as speech in noise. We analyze three different speech synthesizers in three different noise settings. We measure quantitatively the intelligibility of each synthesizer in each noise setting based on human performance on a listening task. Finally, treating the synthesizer and its generated audio as a black box, we present how word level and sentence level input choices can lead to increased or decreased listener error rates for synthesized speech.

We approach Code-Switching through Language Modeling (LM) on a corpus of Hinglish (Hindi + Englis... more We approach Code-Switching through Language Modeling (LM) on a corpus of Hinglish (Hindi + English) that we collected from blogging websites, containing 59,189 unique sentences. We implement and discuss different Language Models derived from a multi-layered LSTM architecture. Our main hypothesis is that providing language id information explicitly builds a robust language model as opposed to simple word level models by learning the switching points. We make an attempt at this in two ways: (1) factored model learning embeddings both for input word and input language, and (2) multi-task learning of predicting the language of the next word along with the word itself. We show that our highest performing model achieves a test perplexity of 19.52 on the CS corpus that we collected and processed. On this data we demonstrate that our performance is an improvement over AWD-LSTM LM (a recent State of Art on monolingual English).

Multimodal machine learning is a core research area spanning the language, visual and acoustic mo... more Multimodal machine learning is a core research area spanning the language, visual and acoustic modalities. The central challenge in multimodal learning involves learning representations that can process and relate information from multiple modalities. In this paper, we propose two methods for unsupervised learning of joint multimodal representations using sequence to sequence (Seq2Seq) methods: a Seq2Seq Modality Translation Model and a Hierarchical Seq2Seq Modality Translation Model. We also explore multiple different variations on the multimodal inputs and outputs of these seq2seq models. Our experiments on multimodal sentiment analysis using the CMU-MOSI dataset indicate that our methods learn informative mul-timodal representations that outperform the baselines and achieve improved performance on multimodal sentiment analysis , specifically in the Bimodal case where our model is able to improve F1 Score by twelve points. We also discuss future directions for multimodal Seq2Seq methods.

Recent years have seen a surge in consumer usage of spoken dialog systems, due to the popularity ... more Recent years have seen a surge in consumer usage of spoken dialog systems, due to the popularity of voice assistants. While these systems are capable of answering factual questions or executing basic tasks, they do not yet have the capability to hold multi-turn conversations. The Alexa Prize challenge provides us a great opportunity to explore various approaches and dialog strategies for building a multi-turn conversational agent. In this report we identify key challenges to build a social conversational dialog system, and present CMU Magnus, an intelligent interactive spoken dialog system that can hold conversations over a range of topics. The system learns and updates itself over time, and can handle argumentative or subjective conversations.

Building dialogue interfaces for real-world scenarios often entails training semantic parsers sta... more Building dialogue interfaces for real-world scenarios often entails training semantic parsers starting from zero examples. How can we build datasets that better capture the variety of ways users might phrase their queries, and what queries are actually realistic? Wang et al. (2015) proposed a method to build semantic parsing datasets by generating canonical utterances using a grammar and having crowdworkers paraphrase them into natural wording. A limitation of this approach is that it induces bias towards using similar language as the canonical utterances. In this work, we present a methodology that elicits meaningful and lexically diverse queries from users for semantic parsing tasks. Starting from a seed lexicon and a generative grammar, we pair logical forms with mixed text-image representations and ask crowdworkers to paraphrase and confirm the plausibility of the queries that they generated. We use this method to build a semantic parsing dataset from scratch for a dialog agent in a smart-home simulation. We find evidence that this dataset, which we have named SMARTHOME, is demon-strably more lexically diverse and difficult to parse than existing domain-specific semantic parsing datasets.

In this paper we offer a model, drawing inspiration from human cognition and based upon the pipel... more In this paper we offer a model, drawing inspiration from human cognition and based upon the pipeline developed for IBM's Watson, which solves clues in a type of word puzzle called syllacrostics. We briefly discuss its situation with respect to the greater field of artificial general intelligence (AGI) and how this process and model might be applied to other types of word puzzles. We present an overview of a system that has been developed to solve syllacrostics.

AAAI 2019, 2019

NAACL-HLT 2019, 2019