Yoshua Bengio - Academia.edu (original) (raw)
Papers by Yoshua Bengio
Journal of computational neuroscience, 2010
Dopaminergic neuron activity has been modeled during learning and appetitive behavior, most commo... more Dopaminergic neuron activity has been modeled during learning and appetitive behavior, most commonly using the temporal-difference (TD) algorithm. However, a proper representation of elapsed time and of the exact task is usually required for the model to work. Most models use timing elements such as delay-line representations of time that are not biologically realistic for intervals in the range of seconds. The interval-timing literature provides several alternatives. One of them is that timing could emerge from general network dynamics, instead of coming from a dedicated circuit. Here, we present a general rate-based learning model based on long short-term memory (LSTM) networks that learns a time representation when needed. Using a naïve network learning its environment in conjunction with TD, we reproduce dopamine activity in appetitive trace conditioning with a constant CS-US interval, including probe trials with unexpected delays. The proposed model learns a representation of t...
MIT Press, Dec 8, 2014
We propose a new framework for estimating generative models via an adversarial process, in which ... more We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1 2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Unsupervised and Transfer Learning Workshop, in conjunction with ICML, 2011
Learning good representations from a large set of unlabeled data is a particularly challenging ta... more Learning good representations from a large set of unlabeled data is a particularly challenging task. Recent work (see Bengio (2009) for a review) shows that training deep architectures is a good way to extract such representations, by extracting and disentangling gradually higher-level factors of variation characterizing the input distribution. In this paper, we describe different kinds of layers we trained for learning representations in the setting of the Unsupervised and Transfer Learning Challenge. The strategy of our team won the ...
Biological Cybernetics, Nov 1, 2013
Dopaminergic models based on the temporal-difference learning algorithm (TD) usually do not diffe... more Dopaminergic models based on the temporal-difference learning algorithm (TD) usually do not differentiate trace from delay conditioning. Instead, they use a fixed temporal representation of elapsed time since conditioned stimulus onset. Recently, a new model was proposed in which timing is learned within a long short-term memory (LSTM) artificial neural network representing the cerebral cortex (Rivest et al. 2010). In this paper, that model’s ability to reproduce and explain relevant data, as well as its ability to make interesting new predictions, are evaluated. The model reveals a strikingly different temporal representation between trace and delay conditioning since trace conditioning requires working memory to remember the past conditioned stimulus while delay conditioning does not. On the other hand, the model predicts no important difference in DA responses between those two conditions when trained on one conditioning paradigm and tested on the other. The model predicts that i...
JMLR W&CP: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS, Apr 1, 2011
Recent theoretical and empirical work in statistical machine learning has demonstrated the potent... more Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, ie, function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because they can be shared across tasks and examples from different but related distributions, can yield even more benefits. Comparative experiments were performed on a large-scale handwritten character recognition setting ...
This paper introduces the Metric-Free Natural Gradient (MFNG) algorithm for training Boltzmann Ma... more This paper introduces the Metric-Free Natural Gradient (MFNG) algorithm for training Boltzmann Machines. Similar in spirit to the Hessian-Free method of Martens [8], our algorithm belongs to the family of truncated Newton methods and exploits an efficient matrix-vector product to avoid explicitely storing the natural gradient metric LLL. This metric is shown to be the expected second derivative of the log-partition function (under the model distribution), or equivalently, the variance of the vector of partial derivatives of the energy function. We evaluate our method on the task of joint-training a 3-layer Deep Boltzmann Machine and show that MFNG does indeed have faster per-epoch convergence compared to Stochastic Maximum Likelihood with centering, though wall-clock performance is currently not competitive.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, g... more In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Speech Communication, 1990
Abstract The vowel sub-component of a speaker-independent phoneme classification system will be d... more Abstract The vowel sub-component of a speaker-independent phoneme classification system will be described. The architecture of the vowel classifier is based on an ear model followed by a set of Multi-Layered Neural Networks (MLNN). MLNNs are trained to learn how to recognize articulatory features like the place of articulation and the manner of articulation related to tongue position.
IEEE Transactions on Neural Networks, 2001
We introduce an asset-allocation framework based on the active control of the value-at-risk of th... more We introduce an asset-allocation framework based on the active control of the value-at-risk of the portfolio. Within this framework, we compare two paradigms for making the allocation using neural networks. The first one uses the network to make a forecast of asset behavior, in conjunction with a traditional mean-variance allocator for constructing the portfolio. The second paradigm uses the network to directly make the portfolio allocation decisions. We consider a method for performing soft input variable selection, and show its considerable utility. We use model combination (committee) methods to systematize the choice of hyperparemeters during training. We show that committees using both paradigms are significantly outperforming the benchmark market performance.
arXiv preprint arXiv:1103.2832, Mar 15, 2011
Abstract: This paper describes two applications of conditional restricted Boltzmann machines (CRB... more Abstract: This paper describes two applications of conditional restricted Boltzmann machines (CRBMs) to the task of autotagging music. The first consists of training a CRBM to predict tags that a user would apply to a clip of a song based on tags already applied by other users. By learning the relationships between tags, this model is able to pre-process training data to significantly improve the performance of a support vector machine (SVM) autotagging. The second is the use of a discriminative RBM, a type of CRBM, to autotag ...
Proceedings EUROSPEECH, 1991
In this paper we compare two hybrid acoustic-phonetic decoders based on Arti cial Neural Networks... more In this paper we compare two hybrid acoustic-phonetic decoders based on Arti cial Neural Networks (ANN). We evaluate them on the task of recognizing stop phones in continuous speech independently from the speaker. ANNs are well suited to perform detailed phonetic distinctions. In general, techniques based on Dynamic Programming (DP), in particular Hidden Markov Models (HMMs), have proven to be successful at modeling the temporal structure of the speech signal. In the approach described here, the ANN outputs ...
Université de Montréal, Rapport technique, 1997
The problem of computing the minimum redundancy codes as we observe symbols one by one has receiv... more The problem of computing the minimum redundancy codes as we observe symbols one by one has received a lot of attention. However, existing algorithm implicitly assumes that either we have a small alphabet—quite typically 256 symbols—or that we have an arbitrary amount of memory at our disposal for the creation of the tree. In real life applications one may need to encode symbols coming from a much larger alphabet, for eg coding integers. We now have to deal not with hundreds of symbols but possibly with ...
(57) ABSTRACT A binary arithmetic coder and decoder provides improved coding accuracy due to impr... more (57) ABSTRACT A binary arithmetic coder and decoder provides improved coding accuracy due to improved probability estimation and adaptation. They also provide improved decoding speed through a" fast path" design wherein decoding of a most probable symbol requires few computational steps. Coded data represents data that is populated by more probable symbols (" MPS") and less probable symbols (" LPS"). In an embodiment, a decoder receives a segment of the coded data as a binary fraction C. It defines a coding ...
Proceedings of the 13th International Workshop on AI and Statistics, May 13, 2010
Alternating Gibbs sampling is the most common scheme used for sampling from Restricted Boltzmann ... more Alternating Gibbs sampling is the most common scheme used for sampling from Restricted Boltzmann Machines (RBM), a crucial component in deep architectures such as Deep Belief Networks. However, we find that it often does a very poor job of rendering the diversity of modes captured by the trained model. We suspect that this hinders the advantage that could in principle be brought by training algorithms relying on Gibbs sampling for uncovering spurious modes, such as the Persistent Contrastive Divergence ...
Proceedings of the 11th ISMIR (International Society for Music Information Retrieval) conference, Aug 11, 2010
MTurk data collected 5 clips from each of 185 random blog tracks= 925 clips► each seen by 3 turke... more MTurk data collected 5 clips from each of 185 random blog tracks= 925 clips► each seen by 3 turkers and described with 18 tags on average Total of 2,500 (user, clip) pairs, 15,500 (user, clip, tag) triples Paid 0.03–0.03–0.03–0.05 per clip, total of about $100 Rejected 11% of responses as spammy or incomplete
Deep architectures have demonstrated state-of-the-art results in a variety of settings, especiall... more Deep architectures have demonstrated state-of-the-art results in a variety of settings, especially with vision datasets. Beyond the model definitions and the quantitative analyses, there is a need for qualitative comparisons of the solutions learned by various deep architectures. The goal of this paper is to find good qualitative interpretations of high level features represented by such models. To this end, we contrast and compare several techniques applied on Stacked Denoising Autoencoders and Deep Belief Networks, ...
Jmlr, 2007
ABSTRACT We propose an estimator for the conditional density p(Y |X) that can adapt for asymmetri... more ABSTRACT We propose an estimator for the conditional density p(Y |X) that can adapt for asymmetric heavy tails which might depend on X. Such estimators have important applications in nance and insurance. We draw from Extreme Value Theory the tools to build a hybrid unimodal density having a parameter controlling the heaviness of the upper tail. This hybrid is a Gaussian whose upper tail has been replaced by a generalized Pareto tail. We use this hybrid in a multi-modal mixture in order to obtain a nonparametric density estimator that can easily adapt for heavy tailed data. To obtain a conditional density estimator, the parameters of the mixture estimator can be seen as functions of X and these functions learned. We show experimentally that this approach better models the conditional density in terms of likelihood than compared competing algorithms: conditional mixture models with other types of components and multivariate nonparametric models.
Inspired by recent work in machine translation and object detection, we introduce an attention ba... more Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-theart performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.
Journal of computational neuroscience, 2010
Dopaminergic neuron activity has been modeled during learning and appetitive behavior, most commo... more Dopaminergic neuron activity has been modeled during learning and appetitive behavior, most commonly using the temporal-difference (TD) algorithm. However, a proper representation of elapsed time and of the exact task is usually required for the model to work. Most models use timing elements such as delay-line representations of time that are not biologically realistic for intervals in the range of seconds. The interval-timing literature provides several alternatives. One of them is that timing could emerge from general network dynamics, instead of coming from a dedicated circuit. Here, we present a general rate-based learning model based on long short-term memory (LSTM) networks that learns a time representation when needed. Using a naïve network learning its environment in conjunction with TD, we reproduce dopamine activity in appetitive trace conditioning with a constant CS-US interval, including probe trials with unexpected delays. The proposed model learns a representation of t...
MIT Press, Dec 8, 2014
We propose a new framework for estimating generative models via an adversarial process, in which ... more We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1 2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.
Unsupervised and Transfer Learning Workshop, in conjunction with ICML, 2011
Learning good representations from a large set of unlabeled data is a particularly challenging ta... more Learning good representations from a large set of unlabeled data is a particularly challenging task. Recent work (see Bengio (2009) for a review) shows that training deep architectures is a good way to extract such representations, by extracting and disentangling gradually higher-level factors of variation characterizing the input distribution. In this paper, we describe different kinds of layers we trained for learning representations in the setting of the Unsupervised and Transfer Learning Challenge. The strategy of our team won the ...
Biological Cybernetics, Nov 1, 2013
Dopaminergic models based on the temporal-difference learning algorithm (TD) usually do not diffe... more Dopaminergic models based on the temporal-difference learning algorithm (TD) usually do not differentiate trace from delay conditioning. Instead, they use a fixed temporal representation of elapsed time since conditioned stimulus onset. Recently, a new model was proposed in which timing is learned within a long short-term memory (LSTM) artificial neural network representing the cerebral cortex (Rivest et al. 2010). In this paper, that model’s ability to reproduce and explain relevant data, as well as its ability to make interesting new predictions, are evaluated. The model reveals a strikingly different temporal representation between trace and delay conditioning since trace conditioning requires working memory to remember the past conditioned stimulus while delay conditioning does not. On the other hand, the model predicts no important difference in DA responses between those two conditions when trained on one conditioning paradigm and tested on the other. The model predicts that i...
JMLR W&CP: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS, Apr 1, 2011
Recent theoretical and empirical work in statistical machine learning has demonstrated the potent... more Recent theoretical and empirical work in statistical machine learning has demonstrated the potential of learning algorithms for deep architectures, ie, function classes obtained by composing multiple levels of representation. The hypothesis evaluated here is that intermediate levels of representation, because they can be shared across tasks and examples from different but related distributions, can yield even more benefits. Comparative experiments were performed on a large-scale handwritten character recognition setting ...
This paper introduces the Metric-Free Natural Gradient (MFNG) algorithm for training Boltzmann Ma... more This paper introduces the Metric-Free Natural Gradient (MFNG) algorithm for training Boltzmann Machines. Similar in spirit to the Hessian-Free method of Martens [8], our algorithm belongs to the family of truncated Newton methods and exploits an efficient matrix-vector product to avoid explicitely storing the natural gradient metric LLL. This metric is shown to be the expected second derivative of the log-partition function (under the model distribution), or equivalently, the variance of the vector of partial derivatives of the energy function. We evaluate our method on the task of joint-training a 3-layer Deep Boltzmann Machine and show that MFNG does indeed have faster per-epoch convergence compared to Stochastic Maximum Likelihood with centering, though wall-clock performance is currently not competitive.
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, g... more In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.
Speech Communication, 1990
Abstract The vowel sub-component of a speaker-independent phoneme classification system will be d... more Abstract The vowel sub-component of a speaker-independent phoneme classification system will be described. The architecture of the vowel classifier is based on an ear model followed by a set of Multi-Layered Neural Networks (MLNN). MLNNs are trained to learn how to recognize articulatory features like the place of articulation and the manner of articulation related to tongue position.
IEEE Transactions on Neural Networks, 2001
We introduce an asset-allocation framework based on the active control of the value-at-risk of th... more We introduce an asset-allocation framework based on the active control of the value-at-risk of the portfolio. Within this framework, we compare two paradigms for making the allocation using neural networks. The first one uses the network to make a forecast of asset behavior, in conjunction with a traditional mean-variance allocator for constructing the portfolio. The second paradigm uses the network to directly make the portfolio allocation decisions. We consider a method for performing soft input variable selection, and show its considerable utility. We use model combination (committee) methods to systematize the choice of hyperparemeters during training. We show that committees using both paradigms are significantly outperforming the benchmark market performance.
arXiv preprint arXiv:1103.2832, Mar 15, 2011
Abstract: This paper describes two applications of conditional restricted Boltzmann machines (CRB... more Abstract: This paper describes two applications of conditional restricted Boltzmann machines (CRBMs) to the task of autotagging music. The first consists of training a CRBM to predict tags that a user would apply to a clip of a song based on tags already applied by other users. By learning the relationships between tags, this model is able to pre-process training data to significantly improve the performance of a support vector machine (SVM) autotagging. The second is the use of a discriminative RBM, a type of CRBM, to autotag ...
Proceedings EUROSPEECH, 1991
In this paper we compare two hybrid acoustic-phonetic decoders based on Arti cial Neural Networks... more In this paper we compare two hybrid acoustic-phonetic decoders based on Arti cial Neural Networks (ANN). We evaluate them on the task of recognizing stop phones in continuous speech independently from the speaker. ANNs are well suited to perform detailed phonetic distinctions. In general, techniques based on Dynamic Programming (DP), in particular Hidden Markov Models (HMMs), have proven to be successful at modeling the temporal structure of the speech signal. In the approach described here, the ANN outputs ...
Université de Montréal, Rapport technique, 1997
The problem of computing the minimum redundancy codes as we observe symbols one by one has receiv... more The problem of computing the minimum redundancy codes as we observe symbols one by one has received a lot of attention. However, existing algorithm implicitly assumes that either we have a small alphabet—quite typically 256 symbols—or that we have an arbitrary amount of memory at our disposal for the creation of the tree. In real life applications one may need to encode symbols coming from a much larger alphabet, for eg coding integers. We now have to deal not with hundreds of symbols but possibly with ...
(57) ABSTRACT A binary arithmetic coder and decoder provides improved coding accuracy due to impr... more (57) ABSTRACT A binary arithmetic coder and decoder provides improved coding accuracy due to improved probability estimation and adaptation. They also provide improved decoding speed through a" fast path" design wherein decoding of a most probable symbol requires few computational steps. Coded data represents data that is populated by more probable symbols (" MPS") and less probable symbols (" LPS"). In an embodiment, a decoder receives a segment of the coded data as a binary fraction C. It defines a coding ...
Proceedings of the 13th International Workshop on AI and Statistics, May 13, 2010
Alternating Gibbs sampling is the most common scheme used for sampling from Restricted Boltzmann ... more Alternating Gibbs sampling is the most common scheme used for sampling from Restricted Boltzmann Machines (RBM), a crucial component in deep architectures such as Deep Belief Networks. However, we find that it often does a very poor job of rendering the diversity of modes captured by the trained model. We suspect that this hinders the advantage that could in principle be brought by training algorithms relying on Gibbs sampling for uncovering spurious modes, such as the Persistent Contrastive Divergence ...
Proceedings of the 11th ISMIR (International Society for Music Information Retrieval) conference, Aug 11, 2010
MTurk data collected 5 clips from each of 185 random blog tracks= 925 clips► each seen by 3 turke... more MTurk data collected 5 clips from each of 185 random blog tracks= 925 clips► each seen by 3 turkers and described with 18 tags on average Total of 2,500 (user, clip) pairs, 15,500 (user, clip, tag) triples Paid 0.03–0.03–0.03–0.05 per clip, total of about $100 Rejected 11% of responses as spammy or incomplete
Deep architectures have demonstrated state-of-the-art results in a variety of settings, especiall... more Deep architectures have demonstrated state-of-the-art results in a variety of settings, especially with vision datasets. Beyond the model definitions and the quantitative analyses, there is a need for qualitative comparisons of the solutions learned by various deep architectures. The goal of this paper is to find good qualitative interpretations of high level features represented by such models. To this end, we contrast and compare several techniques applied on Stacked Denoising Autoencoders and Deep Belief Networks, ...
Jmlr, 2007
ABSTRACT We propose an estimator for the conditional density p(Y |X) that can adapt for asymmetri... more ABSTRACT We propose an estimator for the conditional density p(Y |X) that can adapt for asymmetric heavy tails which might depend on X. Such estimators have important applications in nance and insurance. We draw from Extreme Value Theory the tools to build a hybrid unimodal density having a parameter controlling the heaviness of the upper tail. This hybrid is a Gaussian whose upper tail has been replaced by a generalized Pareto tail. We use this hybrid in a multi-modal mixture in order to obtain a nonparametric density estimator that can easily adapt for heavy tailed data. To obtain a conditional density estimator, the parameters of the mixture estimator can be seen as functions of X and these functions learned. We show experimentally that this approach better models the conditional density in terms of likelihood than compared competing algorithms: conditional mixture models with other types of components and multivariate nonparametric models.
Inspired by recent work in machine translation and object detection, we introduce an attention ba... more Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-theart performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.