FiLM: Visual Reasoning with a General Conditioning Layer (original) (raw)

Authors

Ethan Perez
Florian Strub Univ. Lille, CNRS, Centrale Lille, Inria, UMR 9189 CRIStAL France
Harm de Vries MILA, Universite de Montreal
Vincent Dumoulin MILA, Universite de Montreal
Aaron Courville MILA, Universite de Montreal, CIFAR Fellow

DOI:

https://doi.org/10.1609/aaai.v32i1.11671

Keywords:

Deep Learning, Language and Vision

Abstract

We introduce a general-purpose conditioning method for neural networks called FiLM: Feature-wise Linear Modulation. FiLM layers influence neural network computation via a simple, feature-wise affine transformation based on conditioning information. We show that FiLM layers are highly effective for visual reasoning - answering image-related questions which require a multi-step, high-level process - a task which has proven difficult for standard deep learning methods that do not explicitly model reasoning. Specifically, we show on visual reasoning tasks that FiLM layers 1) halve state-of-the-art error for the CLEVR benchmark, 2) modulate features in a coherent manner, 3) are robust to ablations and architectural modifications, and 4) generalize well to challenging, new data from few examples or even zero-shot.

How to Cite

Perez, E., Strub, F., de Vries, H., Dumoulin, V., & Courville, A. (2018). FiLM: Visual Reasoning with a General Conditioning Layer. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). https://doi.org/10.1609/aaai.v32i1.11671

Issue

Section

AAAI Technical Track: Machine Learning