Yani Ioannou | University of Calgary (original) (raw)
Papers by Yani Ioannou
Although many real-world applications, such as disease prediction, and fault detection suffer fro... more Although many real-world applications, such as disease prediction, and fault detection suffer from class imbalance, most existing graph-based classification methods ignore the skewness of the distribution of classes; therefore, tend to be biased towards the majority class(es). Conventional methods typically tackle this problem through the assignment of weights to each one of the class samples based on a function of their loss, which can lead to over-fitting on outliers. In this paper, we propose a meta-learning algorithm, named Meta-GCN, for adaptively learning the example weights by simultaneously minimizing the unbiased meta-data set loss and optimizing the model weights through the use of a small unbiased meta-data set. Through experiments, we have shown that Meta-GCN outperforms state-of-the-art frameworks and other baselines in terms of accuracy, the area under the receiver operating characteristic (AUC-ROC) curve, and macro F1-Score for classification tasks on two different datasets.
VizieR Online Data Catalog, Apr 1, 2019
The astrophysical journal, Dec 5, 2018
Space-based missions such as Kepler, and soon TESS, provide large datasets that must be analyzed ... more Space-based missions such as Kepler, and soon TESS, provide large datasets that must be analyzed efficiently and systematically. Recent work by Shallue & Vanderburg (2018) successfully used stateof-the-art deep learning models to automatically classify Kepler transit signals as either exoplanets or false positives; our application of their model yielded 95.8% accuracy and 95.5% average precision. Here we expand upon that work by including additional scientific domain knowledge into the network architecture and input representations to significantly increase overall model performance to 97.5% accuracy and 98.0% average precision. Notably, we achieve 15-20% gains in recall for the lowest signal-to-noise transits that can correspond to rocky planets in the habitable zone. We input into the network centroid time-series information derived from Kepler data plus key stellar parameters taken from the Kepler DR25 and Gaia DR2 catalogues. We also implement data augmentation techniques to alleviate model over-fitting. These improvements allow us to drastically reduce the size of the model, while still maintaining improved performance; smaller models are better for generalization, for example from Kepler to TESS data. This work illustrates the importance of including expert domain knowledge in even state-of-the-art deep learning models when applying them to scientific research problems that seek to identify weak signals in noisy data. This classification tool will be especially useful for upcoming space-based photometry missions focused on finding small planets, such as TESS and PLATO.
Computer Vision and Pattern Recognition, Mar 1, 2017
International Conference on Learning Representations, May 1, 2016
We propose a new method for creating computationally efficient convolutional neural networks (CNN... more We propose a new method for creating computationally efficient convolutional neural networks (CNNs) by using low-rank representations of convolutional filters. Rather than approximating filters in previously-trained networks with more efficient versions, we learn a set of small basis filters from scratch; during training, the network learns to combine these basis filters into more complex filters that are discriminative for image classification. To train such networks, a novel weight initialization scheme is used. This allows effective initialization of connection weights in convolutional layers composed of groups of differently-shaped filters. We validate our approach by applying it to several existing CNN architectures and training these networks from scratch using the CIFAR, ILSVRC and MIT Places datasets. Our results show similar or higher accuracy than conventional CNNs with much less compute. Applying our method to an improved version of VGG-11 network using global max-pooling, we achieve comparable validation accuracy using 41% less compute and only 24% of the original VGG-11 model parameters; another variant of our method gives a 1 percentage point increase in accuracy over our improved VGG-11 model, giving a top-5 center-crop validation accuracy of 89.7% while reducing computation by 16% relative to the original VGG-11 model. Applying our method to the GoogLeNet architecture for ILSVRC, we achieved comparable accuracy with 26% less compute and 41% fewer model parameters. Applying our method to a near state-of-the-art network for CIFAR, we achieved comparable accuracy with 46% less compute and 55% fewer parameters.
Astronomy and Astrophysics, 2020
Aims. Accurately and rapidly classifying exoplanet candidates from transit surveys is a goal of g... more Aims. Accurately and rapidly classifying exoplanet candidates from transit surveys is a goal of growing importance as the data rates from space-based survey missions increase. This is especially true for the NASA TESS mission which generates thousands of new candidates each month. Here we created the first deep-learning model capable of classifying TESS planet candidates. Methods. We adapted an existing neural network model and then trained and tested this updated model on four sectors of high-fidelity, pixel-level TESS simulations data created using the Lilith simulator and processed using the full TESS pipeline. With the caveat that direct transfer of the model to real data will not perform as accurately, we also applied this model to four sectors of TESS candidates. Results. We find our model performs very well on our simulated data, with 97% average precision and 92% accuracy on planets in the two-class model. This accuracy is also boosted by another ∼4% if planets found at the wrong periods are included. We also performed three-class and four-class classification of planets, blended and target eclipsing binaries, and non-astrophysical false positives, which have slightly lower average precision and planet accuracies but are useful for follow-up decisions. When applied to real TESS data, 61% of threshold crossing events (TCEs) coincident with currently published TESS objects of interest are recovered as planets, 4% more are suggested to be eclipsing binaries, and we propose a further 200 TCEs as planet candidates.
arXiv (Cornell University), May 3, 2023
Dynamic Sparse Training (DST) methods achieve state-of-the-art results in sparse neural network t... more Dynamic Sparse Training (DST) methods achieve state-of-the-art results in sparse neural network training, matching the generalization of dense models while enabling sparse training and inference. Although the resulting models are highly sparse and theoretically less computationally expensive, achieving speedups with unstructured sparsity on real-world hardware is challenging. In this work, we propose a sparse-to-sparse DST method, Structured RigL (SRigL), to learn a variant of fine-grained structured N:M sparsity by imposing a constant fan-in constraint. Using our empirical analysis of existing DST methods at high sparsity, we additionally employ a neuron ablation method which enables SRigL to achieve state-of-the-art sparse-to-sparse structured DST performance on a variety of Neural Network (NN) architectures. Using a 90% sparse linear layer, we demonstrate a real-world acceleration of 3.4×/2.5× on CPU for online inference and 1.7×/13.0× on GPU for inference with a batch size of 256 when compared to equivalent dense/unstructured (CSR) sparse layers, respectively.
arXiv (Cornell University), Jul 19, 2022
Estimating the Generalization Error (GE) of Deep Neural Networks (DNNs) is an important task that... more Estimating the Generalization Error (GE) of Deep Neural Networks (DNNs) is an important task that often relies on availability of held-out data. The ability to better predict GE based on a single training set may yield overarching DNN design principles to reduce a reliance on trial-and-error, along with other performance assessment advantages. In search of a quantity relevant to GE, we investigate the Mutual Information (MI) between the input and final layer representations, using the infinite-width DNN limit to bound MI. An existing input compression-based GE bound is used to link MI and GE. To the best of our knowledge, this represents the first empirical study of this bound. In our attempt to empirically falsify the theoretical bound, we find that it is often tight for best-performing models. Furthermore, it detects randomization of training labels in many cases, reflects test-time perturbation robustness, and works well given only few training samples. These results are promising given that input compression is broadly applicable where MI can be estimated with confidence. 1 GE is also referred to as generalization gap. Note that some use "generalization error" as a synonym for "test error".
International Conference on Learning Representations, May 1, 2016
arXiv (Cornell University), Jun 26, 2022
The failure of deep neural networks to generalize to out-of-distribution data is a well-known pro... more The failure of deep neural networks to generalize to out-of-distribution data is a well-known problem and raises concerns about the deployment of trained networks in safety-critical domains such as healthcare, finance and autonomous vehicles. We study a particular kind of distribution shift shortcuts or spurious correlations in the training data. Shortcut learning is often only exposed when models are evaluated on real-world data that does not contain the same spurious correlations, posing a serious dilemma for AI practitioners to properly assess the effectiveness of a trained model for real-world applications. In this work, we propose to use the mutual information (MI) between the learned representation and the input as a metric to find where in training the network latches onto shortcuts. Experiments demonstrate that MI can be used as a domain-agnostic metric for monitoring shortcut learning.
Astronomical Data Analysis Software and Systems XXVII, Oct 1, 2019
Differents modes de realisation de l'invention concernent des systemes et des procedes de det... more Differents modes de realisation de l'invention concernent des systemes et des procedes de detection d'urgence et de reponse automatises. Dans certains modes de realisation, des ensembles de donnees respectifs generes en relation avec des capteurs situes de maniere distincte sont compares, et un protocole de reponse approprie est selectionne sur la base de cette comparaison et en fonction d'au moins l'un des ensembles de donnees.
International Conference on Learning Representations, May 27, 2016
We train CNNs with composite layers of oriented low-rank filters, of which the network learns the... more We train CNNs with composite layers of oriented low-rank filters, of which the network learns the most effective linear combination In effect our networks learn a basis space for filters, based on simpler low-rank filters We propose an initialization for composite layers of heterogeneous filters, to train such networks from scratch Our models are faster and use less parameters With a small number of full filters, our models also generalize better
arXiv (Cornell University), Nov 20, 2015
We propose a new method for creating computationally efficient convolutional neural networks (CNN... more We propose a new method for creating computationally efficient convolutional neural networks (CNNs) by using low-rank representations of convolutional filters. Rather than approximating filters in previously-trained networks with more efficient versions, we learn a set of small basis filters from scratch; during training, the network learns to combine these basis filters into more complex filters that are discriminative for image classification. To train such networks, a novel weight initialization scheme is used. This allows effective initialization of connection weights in convolutional layers composed of groups of differently-shaped filters. We validate our approach by applying it to several existing CNN architectures and training these networks from scratch using the CIFAR, ILSVRC and MIT Places datasets. Our results show similar or higher accuracy than conventional CNNs with much less compute. Applying our method to an improved version of VGG-11 network using global max-pooling, we achieve comparable validation accuracy using 41% less compute and only 24% of the original VGG-11 model parameters; another variant of our method gives a 1 percentage point increase in accuracy over our improved VGG-11 model, giving a top-5 center-crop validation accuracy of 89.7% while reducing computation by 16% relative to the original VGG-11 model. Applying our method to the GoogLeNet architecture for ILSVRC, we achieved comparable accuracy with 26% less compute and 41% fewer model parameters. Applying our method to a near state-of-the-art network for CIFAR, we achieved comparable accuracy with 46% less compute and 55% fewer parameters.
ArXiv, 2021
Recent advancements in self-supervised learning have reduced the gap between supervised and unsup... more Recent advancements in self-supervised learning have reduced the gap between supervised and unsupervised representation learning. However, most self-supervised and deep clustering techniques rely heavily on data augmentation, rendering them ineffective for many learning tasks where insufficient domain knowledge exists for performing augmentation. We propose a new self-distillation based algorithm for domain-agnostic clustering. Our method builds upon the existing deep clustering frameworks and requires no separate student model. The proposed method outperforms existing domain agnostic (augmentation-free) algorithms on CIFAR-10. We empirically demonstrate that knowledge distillation can improve unsupervised representation learning by extracting richer ‘dark knowledge’ from the model than using predicted labels alone. Preliminary experiments also suggest that self-distillation improves the convergence of DeepCluster-v2.
Despite having high accuracy, neural nets have been shown to be susceptible to adversarial exampl... more Despite having high accuracy, neural nets have been shown to be susceptible to adversarial examples, where a small perturbation to an input can cause it to become mislabeled. We propose metrics for measuring the robustness of a neural net and devise a novel algorithm for approximating these metrics based on an encoding of robustness as a linear program. We show how our metrics can be used to evaluate the robustness of deep neural nets with experiments on the MNIST and CIFAR-10 datasets. Our algorithm generates more informative estimates of robustness metrics compared to estimates based on existing algorithms. Furthermore, we show how existing approaches to improving robustness "overfit" to adversarial examples generated using a specific algorithm. Finally, we show that our techniques can be used to additionally improve neural net robustness both according to the metrics that we propose, but also according to previously proposed metrics.
Deep learning has in recent years come to dominate the previously separate fields of research in ... more Deep learning has in recent years come to dominate the previously separate fields of research in machine learning, computer vision, natural language understanding and speech recognition. Despite breakthroughs in training deep networks, there remains a lack of understanding of both the optimization and structure of deep networks. The approach advocated by many researchers in the field has been to train monolithic networks with excess complexity, and strong regularization — an approach that leaves much to desire in efficiency. Instead we propose that carefully designing networks in consideration of our prior knowledge of the task and learned representation can improve the memory and compute efficiency of state-of-the art networks, and even improve generalization — what we propose to denote as structural priors. We present two such novel structural priors for convolutional neural networks, and evaluate them in state-of-the-art image classification CNN architectures. The first of these ...
ArXiv, 2016
This paper investigates the connections between two state of the art classifiers: decision forest... more This paper investigates the connections between two state of the art classifiers: decision forests (DFs, including decision jungles) and convolutional neural networks (CNNs). Decision forests are computationally efficient thanks to their conditional computation property (computation is confined to only a small region of the tree, the nodes along a single branch). CNNs achieve state of the art accuracy, thanks to their representation learning capabilities. We present a systematic analysis of how to fuse conditional computation with representation learning and achieve a continuum of hybrid models with different ratios of accuracy vs. efficiency. We call this new family of hybrid models conditional networks. Conditional networks can be thought of as: i) decision trees augmented with data transformation operators, or ii) CNNs, with block-diagonal sparse weight matrices, and explicit data routing functions. Experimental validation is performed on the common task of image classification o...
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
We propose a new method for creating computationally efficient and compact convolutional neural n... more We propose a new method for creating computationally efficient and compact convolutional neural networks (CNNs) using a novel sparse connection structure that resembles a tree root. This allows a significant reduction in computational cost and number of parameters compared to state-of-the-art deep CNNs, without compromising accuracy, by exploiting the sparsity of inter-layer filter dependencies. We validate our approach by using it to train more efficient variants of state-of-the-art CNN architectures, evaluated on the CIFAR10 and ILSVRC datasets. Our results show similar or higher accuracy than the baseline architectures with much less computation, as measured by CPU and GPU timings. For example, for ResNet 50, our model has 40% fewer parameters, 45% fewer floating point operations, and is 31% (12%) faster on a CPU (GPU). For the deeper ResNet 200 our model has 48% fewer parameters and 27% fewer floating point operations, while maintaining state-of-the-art accuracy. For GoogLeNet, our model has 7% fewer parameters and is 21% (16%) faster on a CPU (GPU).
Although many real-world applications, such as disease prediction, and fault detection suffer fro... more Although many real-world applications, such as disease prediction, and fault detection suffer from class imbalance, most existing graph-based classification methods ignore the skewness of the distribution of classes; therefore, tend to be biased towards the majority class(es). Conventional methods typically tackle this problem through the assignment of weights to each one of the class samples based on a function of their loss, which can lead to over-fitting on outliers. In this paper, we propose a meta-learning algorithm, named Meta-GCN, for adaptively learning the example weights by simultaneously minimizing the unbiased meta-data set loss and optimizing the model weights through the use of a small unbiased meta-data set. Through experiments, we have shown that Meta-GCN outperforms state-of-the-art frameworks and other baselines in terms of accuracy, the area under the receiver operating characteristic (AUC-ROC) curve, and macro F1-Score for classification tasks on two different datasets.
VizieR Online Data Catalog, Apr 1, 2019
The astrophysical journal, Dec 5, 2018
Space-based missions such as Kepler, and soon TESS, provide large datasets that must be analyzed ... more Space-based missions such as Kepler, and soon TESS, provide large datasets that must be analyzed efficiently and systematically. Recent work by Shallue & Vanderburg (2018) successfully used stateof-the-art deep learning models to automatically classify Kepler transit signals as either exoplanets or false positives; our application of their model yielded 95.8% accuracy and 95.5% average precision. Here we expand upon that work by including additional scientific domain knowledge into the network architecture and input representations to significantly increase overall model performance to 97.5% accuracy and 98.0% average precision. Notably, we achieve 15-20% gains in recall for the lowest signal-to-noise transits that can correspond to rocky planets in the habitable zone. We input into the network centroid time-series information derived from Kepler data plus key stellar parameters taken from the Kepler DR25 and Gaia DR2 catalogues. We also implement data augmentation techniques to alleviate model over-fitting. These improvements allow us to drastically reduce the size of the model, while still maintaining improved performance; smaller models are better for generalization, for example from Kepler to TESS data. This work illustrates the importance of including expert domain knowledge in even state-of-the-art deep learning models when applying them to scientific research problems that seek to identify weak signals in noisy data. This classification tool will be especially useful for upcoming space-based photometry missions focused on finding small planets, such as TESS and PLATO.
Computer Vision and Pattern Recognition, Mar 1, 2017
International Conference on Learning Representations, May 1, 2016
We propose a new method for creating computationally efficient convolutional neural networks (CNN... more We propose a new method for creating computationally efficient convolutional neural networks (CNNs) by using low-rank representations of convolutional filters. Rather than approximating filters in previously-trained networks with more efficient versions, we learn a set of small basis filters from scratch; during training, the network learns to combine these basis filters into more complex filters that are discriminative for image classification. To train such networks, a novel weight initialization scheme is used. This allows effective initialization of connection weights in convolutional layers composed of groups of differently-shaped filters. We validate our approach by applying it to several existing CNN architectures and training these networks from scratch using the CIFAR, ILSVRC and MIT Places datasets. Our results show similar or higher accuracy than conventional CNNs with much less compute. Applying our method to an improved version of VGG-11 network using global max-pooling, we achieve comparable validation accuracy using 41% less compute and only 24% of the original VGG-11 model parameters; another variant of our method gives a 1 percentage point increase in accuracy over our improved VGG-11 model, giving a top-5 center-crop validation accuracy of 89.7% while reducing computation by 16% relative to the original VGG-11 model. Applying our method to the GoogLeNet architecture for ILSVRC, we achieved comparable accuracy with 26% less compute and 41% fewer model parameters. Applying our method to a near state-of-the-art network for CIFAR, we achieved comparable accuracy with 46% less compute and 55% fewer parameters.
Astronomy and Astrophysics, 2020
Aims. Accurately and rapidly classifying exoplanet candidates from transit surveys is a goal of g... more Aims. Accurately and rapidly classifying exoplanet candidates from transit surveys is a goal of growing importance as the data rates from space-based survey missions increase. This is especially true for the NASA TESS mission which generates thousands of new candidates each month. Here we created the first deep-learning model capable of classifying TESS planet candidates. Methods. We adapted an existing neural network model and then trained and tested this updated model on four sectors of high-fidelity, pixel-level TESS simulations data created using the Lilith simulator and processed using the full TESS pipeline. With the caveat that direct transfer of the model to real data will not perform as accurately, we also applied this model to four sectors of TESS candidates. Results. We find our model performs very well on our simulated data, with 97% average precision and 92% accuracy on planets in the two-class model. This accuracy is also boosted by another ∼4% if planets found at the wrong periods are included. We also performed three-class and four-class classification of planets, blended and target eclipsing binaries, and non-astrophysical false positives, which have slightly lower average precision and planet accuracies but are useful for follow-up decisions. When applied to real TESS data, 61% of threshold crossing events (TCEs) coincident with currently published TESS objects of interest are recovered as planets, 4% more are suggested to be eclipsing binaries, and we propose a further 200 TCEs as planet candidates.
arXiv (Cornell University), May 3, 2023
Dynamic Sparse Training (DST) methods achieve state-of-the-art results in sparse neural network t... more Dynamic Sparse Training (DST) methods achieve state-of-the-art results in sparse neural network training, matching the generalization of dense models while enabling sparse training and inference. Although the resulting models are highly sparse and theoretically less computationally expensive, achieving speedups with unstructured sparsity on real-world hardware is challenging. In this work, we propose a sparse-to-sparse DST method, Structured RigL (SRigL), to learn a variant of fine-grained structured N:M sparsity by imposing a constant fan-in constraint. Using our empirical analysis of existing DST methods at high sparsity, we additionally employ a neuron ablation method which enables SRigL to achieve state-of-the-art sparse-to-sparse structured DST performance on a variety of Neural Network (NN) architectures. Using a 90% sparse linear layer, we demonstrate a real-world acceleration of 3.4×/2.5× on CPU for online inference and 1.7×/13.0× on GPU for inference with a batch size of 256 when compared to equivalent dense/unstructured (CSR) sparse layers, respectively.
arXiv (Cornell University), Jul 19, 2022
Estimating the Generalization Error (GE) of Deep Neural Networks (DNNs) is an important task that... more Estimating the Generalization Error (GE) of Deep Neural Networks (DNNs) is an important task that often relies on availability of held-out data. The ability to better predict GE based on a single training set may yield overarching DNN design principles to reduce a reliance on trial-and-error, along with other performance assessment advantages. In search of a quantity relevant to GE, we investigate the Mutual Information (MI) between the input and final layer representations, using the infinite-width DNN limit to bound MI. An existing input compression-based GE bound is used to link MI and GE. To the best of our knowledge, this represents the first empirical study of this bound. In our attempt to empirically falsify the theoretical bound, we find that it is often tight for best-performing models. Furthermore, it detects randomization of training labels in many cases, reflects test-time perturbation robustness, and works well given only few training samples. These results are promising given that input compression is broadly applicable where MI can be estimated with confidence. 1 GE is also referred to as generalization gap. Note that some use "generalization error" as a synonym for "test error".
International Conference on Learning Representations, May 1, 2016
arXiv (Cornell University), Jun 26, 2022
The failure of deep neural networks to generalize to out-of-distribution data is a well-known pro... more The failure of deep neural networks to generalize to out-of-distribution data is a well-known problem and raises concerns about the deployment of trained networks in safety-critical domains such as healthcare, finance and autonomous vehicles. We study a particular kind of distribution shift shortcuts or spurious correlations in the training data. Shortcut learning is often only exposed when models are evaluated on real-world data that does not contain the same spurious correlations, posing a serious dilemma for AI practitioners to properly assess the effectiveness of a trained model for real-world applications. In this work, we propose to use the mutual information (MI) between the learned representation and the input as a metric to find where in training the network latches onto shortcuts. Experiments demonstrate that MI can be used as a domain-agnostic metric for monitoring shortcut learning.
Astronomical Data Analysis Software and Systems XXVII, Oct 1, 2019
Differents modes de realisation de l'invention concernent des systemes et des procedes de det... more Differents modes de realisation de l'invention concernent des systemes et des procedes de detection d'urgence et de reponse automatises. Dans certains modes de realisation, des ensembles de donnees respectifs generes en relation avec des capteurs situes de maniere distincte sont compares, et un protocole de reponse approprie est selectionne sur la base de cette comparaison et en fonction d'au moins l'un des ensembles de donnees.
International Conference on Learning Representations, May 27, 2016
We train CNNs with composite layers of oriented low-rank filters, of which the network learns the... more We train CNNs with composite layers of oriented low-rank filters, of which the network learns the most effective linear combination In effect our networks learn a basis space for filters, based on simpler low-rank filters We propose an initialization for composite layers of heterogeneous filters, to train such networks from scratch Our models are faster and use less parameters With a small number of full filters, our models also generalize better
arXiv (Cornell University), Nov 20, 2015
We propose a new method for creating computationally efficient convolutional neural networks (CNN... more We propose a new method for creating computationally efficient convolutional neural networks (CNNs) by using low-rank representations of convolutional filters. Rather than approximating filters in previously-trained networks with more efficient versions, we learn a set of small basis filters from scratch; during training, the network learns to combine these basis filters into more complex filters that are discriminative for image classification. To train such networks, a novel weight initialization scheme is used. This allows effective initialization of connection weights in convolutional layers composed of groups of differently-shaped filters. We validate our approach by applying it to several existing CNN architectures and training these networks from scratch using the CIFAR, ILSVRC and MIT Places datasets. Our results show similar or higher accuracy than conventional CNNs with much less compute. Applying our method to an improved version of VGG-11 network using global max-pooling, we achieve comparable validation accuracy using 41% less compute and only 24% of the original VGG-11 model parameters; another variant of our method gives a 1 percentage point increase in accuracy over our improved VGG-11 model, giving a top-5 center-crop validation accuracy of 89.7% while reducing computation by 16% relative to the original VGG-11 model. Applying our method to the GoogLeNet architecture for ILSVRC, we achieved comparable accuracy with 26% less compute and 41% fewer model parameters. Applying our method to a near state-of-the-art network for CIFAR, we achieved comparable accuracy with 46% less compute and 55% fewer parameters.
ArXiv, 2021
Recent advancements in self-supervised learning have reduced the gap between supervised and unsup... more Recent advancements in self-supervised learning have reduced the gap between supervised and unsupervised representation learning. However, most self-supervised and deep clustering techniques rely heavily on data augmentation, rendering them ineffective for many learning tasks where insufficient domain knowledge exists for performing augmentation. We propose a new self-distillation based algorithm for domain-agnostic clustering. Our method builds upon the existing deep clustering frameworks and requires no separate student model. The proposed method outperforms existing domain agnostic (augmentation-free) algorithms on CIFAR-10. We empirically demonstrate that knowledge distillation can improve unsupervised representation learning by extracting richer ‘dark knowledge’ from the model than using predicted labels alone. Preliminary experiments also suggest that self-distillation improves the convergence of DeepCluster-v2.
Despite having high accuracy, neural nets have been shown to be susceptible to adversarial exampl... more Despite having high accuracy, neural nets have been shown to be susceptible to adversarial examples, where a small perturbation to an input can cause it to become mislabeled. We propose metrics for measuring the robustness of a neural net and devise a novel algorithm for approximating these metrics based on an encoding of robustness as a linear program. We show how our metrics can be used to evaluate the robustness of deep neural nets with experiments on the MNIST and CIFAR-10 datasets. Our algorithm generates more informative estimates of robustness metrics compared to estimates based on existing algorithms. Furthermore, we show how existing approaches to improving robustness "overfit" to adversarial examples generated using a specific algorithm. Finally, we show that our techniques can be used to additionally improve neural net robustness both according to the metrics that we propose, but also according to previously proposed metrics.
Deep learning has in recent years come to dominate the previously separate fields of research in ... more Deep learning has in recent years come to dominate the previously separate fields of research in machine learning, computer vision, natural language understanding and speech recognition. Despite breakthroughs in training deep networks, there remains a lack of understanding of both the optimization and structure of deep networks. The approach advocated by many researchers in the field has been to train monolithic networks with excess complexity, and strong regularization — an approach that leaves much to desire in efficiency. Instead we propose that carefully designing networks in consideration of our prior knowledge of the task and learned representation can improve the memory and compute efficiency of state-of-the art networks, and even improve generalization — what we propose to denote as structural priors. We present two such novel structural priors for convolutional neural networks, and evaluate them in state-of-the-art image classification CNN architectures. The first of these ...
ArXiv, 2016
This paper investigates the connections between two state of the art classifiers: decision forest... more This paper investigates the connections between two state of the art classifiers: decision forests (DFs, including decision jungles) and convolutional neural networks (CNNs). Decision forests are computationally efficient thanks to their conditional computation property (computation is confined to only a small region of the tree, the nodes along a single branch). CNNs achieve state of the art accuracy, thanks to their representation learning capabilities. We present a systematic analysis of how to fuse conditional computation with representation learning and achieve a continuum of hybrid models with different ratios of accuracy vs. efficiency. We call this new family of hybrid models conditional networks. Conditional networks can be thought of as: i) decision trees augmented with data transformation operators, or ii) CNNs, with block-diagonal sparse weight matrices, and explicit data routing functions. Experimental validation is performed on the common task of image classification o...
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
We propose a new method for creating computationally efficient and compact convolutional neural n... more We propose a new method for creating computationally efficient and compact convolutional neural networks (CNNs) using a novel sparse connection structure that resembles a tree root. This allows a significant reduction in computational cost and number of parameters compared to state-of-the-art deep CNNs, without compromising accuracy, by exploiting the sparsity of inter-layer filter dependencies. We validate our approach by using it to train more efficient variants of state-of-the-art CNN architectures, evaluated on the CIFAR10 and ILSVRC datasets. Our results show similar or higher accuracy than the baseline architectures with much less computation, as measured by CPU and GPU timings. For example, for ResNet 50, our model has 40% fewer parameters, 45% fewer floating point operations, and is 31% (12%) faster on a CPU (GPU). For the deeper ResNet 200 our model has 48% fewer parameters and 27% fewer floating point operations, while maintaining state-of-the-art accuracy. For GoogLeNet, our model has 7% fewer parameters and is 21% (16%) faster on a CPU (GPU).
We train CNNs with composite layers of oriented low-rank filters, of which the network learns the... more We train CNNs with composite layers of oriented low-rank filters, of which the network learns the most effective linear combination In effect our networks learn a basis space for filters, based on simpler low-rank filters We propose an initialization for composite layers of heterogeneous filters, to train such networks from scratch Our models are faster and use less parameters With a small number of full filters, our models also generalize better Previous Work: Separable (Factorized) Convolution Explicitly approximate low-rank factorization of trained CNN's full-rank filter Use sequential conv. layers with filters of differing orientation [3, 2]. O(d × [h × w × c]) → O(d × [h × m] + m[w × c]) (for each effective filter) However, in most CNNs, d ≥ m c, so this isn't much faster All previous methods approximated a pre-trained model! With our initialization, we can train these networks from scratch VGG-11 GMP Separable 88% top-5 accuracy on ILSVRC Composite Layer-Initialization
PhD Thesis - Cambridge, 2018
A thesis submitted to the School of Computing in conformity with the requirements for the degree ... more A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science (Computing) at Queen's University, Kingston, Ontario, Canada.