Knots in random neural networks (original) (raw)

The Upper Bound on Knots in Neural Networks

ArXiv, 2016

Neural networks with rectified linear unit activations are essentially multivariate linear splines. As such, one of many ways to measure the "complexity" or "expressivity" of a neural network is to count the number of knots in the spline model. We study the number of knots in fully-connected feedforward neural networks with rectified linear unit activation functions. We intentionally keep the neural networks very simple, so as to make theoretical analyses more approachable. An induction on the number of layers lll reveals a tight upper bound on the number of knots in mathbbRtomathbbRp\mathbb{R} \to \mathbb{R}^pmathbbRtomathbbRp deep neural networks. With nigg1n_i \gg 1nigg1 neurons in layer i=1,dots,li = 1, \dots, li=1,dots,l, the upper bound is approximately n1dotsnln_1 \dots n_ln1dotsnl. We then show that the exact upper bound is tight, and we demonstrate the upper bound with an example. The purpose of these analyses is to pave a path for understanding the behavior of general mathbbRqtomathbbRp\mathbb{R}^q \to \mathbb{R}^pmathbbRqtomathbbRp neural networks.

The empirical size of trained neural networks

Cornell University - arXiv, 2016

ReLU neural networks define piecewise linear functions of their inputs. However, initializing and training a neural network is very different from fitting a linear spline. In this paper, we expand empirically upon previous theoretical work to demonstrate features of trained neural networks. Standard network initialization and training produce networks vastly simpler than a naive parameter count would suggest and can impart odd features to the trained network. However, we also show the forced simplicity is beneficial and, indeed, critical for the wide success of these networks.

Spline representation and redundancies of one-dimensional ReLU neural network models

Analysis and Applications

We analyze the structure of a one-dimensional deep ReLU neural network (ReLU DNN) in comparison to the model of continuous piecewise linear (CPL) spline functions with arbitrary knots. In particular, we give a recursive algorithm to transfer the parameter set determining the ReLU DNN into the parameter set of a CPL spline function. Using this representation, we show that after removing the well-known parameter redundancies of the ReLU DNN, which are caused by the positive scaling property, all remaining parameters are independent. Moreover, we show that the ReLU DNN with one, two or three hidden layers can represent CPL spline functions with K arbitrarily prescribed knots (breakpoints), where K is the number of real parameters determining the normalized ReLU DNN (up to the output layer parameters). Our findings are useful to fix a priori conditions on the ReLU DNN to achieve an output with prescribed breakpoints and function values.

On the Effect of the Activation Function on the Distribution of Hidden Nodes in a Deep Network

Neural Computation

We analyze the joint probability distribution on the lengths of the vectors of hidden variables in different layers of a fully connected deep network, when the weights and biases are chosen randomly according to gaussian distributions. We show that if the activation function [Formula: see text] satisfies a minimal set of assumptions, satisfied by all activation functions that we know that are used in practice, then, as the width of the network gets large, the “length process” converges in probability to a length map that is determined as a simple function of the variances of the random weights and biases and the activation function [Formula: see text]. We also show that this convergence may fail for [Formula: see text] that violate our assumptions. We show how to use this analysis to choose the variance of weight initialization, depending on the activation function, so that hidden variables maintain a consistent scale throughout the network.

What training reveals about neural network complexity

ArXiv, 2021

This work explores the hypothesis that the complexity of the function a deep neural network (NN) is learning can be deduced by how fast its weights change during training. Our analysis provides evidence for this supposition by relating the network’s distribution of Lipschitz constants (i.e., the norm of the gradient at different regions of the input space) during different training intervals with the behavior of the stochastic training procedure. We first observe that the average Lipschitz constant close to the training data affects various aspects of the parameter trajectory, with more complex networks having a longer trajectory, bigger variance, and often veering further from their initialization. We then show that NNs whose biases are trained more steadily have bounded complexity even in regions of the input space that are far from any training point. Finally, we find that steady training with Dropout implies a trainingand data-dependent generalization bound that grows poly-logar...

Improving Deep Neural Network Random Initialization Through Neuronal Rewiring

arXiv (Cornell University), 2022

The deep learning literature is continuously updated with new architectures and training techniques. However, weight initialization is overlooked by most recent research, despite some intriguing findings regarding random weights. On the other hand, recent works have been approaching Network Science to understand the structure and dynamics of Artificial Neural Networks (ANNs) after training. Therefore, in this work, we analyze the centrality of neurons in randomly initialized networks. We show that a higher neuronal strength variance may decrease performance, while a lower neuronal strength variance usually improves it. A new method is then proposed to rewire neuronal connections according to a preferential attachment (PA) rule based on their strength, which significantly reduces the strength variance of layers initialized by common methods. In this sense, PA rewiring only reorganizes connections, while preserving the magnitude and distribution of the weights. We show through an extensive statistical analysis in image classification that performance is improved in most cases, both during training and testing, when using both simple and complex architectures and learning schedules. Our results show that, aside from the magnitude, the organization of the weights is also relevant for better initialization of deep ANNs. Keywords artificial neural networks • deep learning • network science • complex networks • computer vision • weight initialization Recent works [1, 2] discuss the impacts of stochastic and random hyperparameters (different random seeds) used during the construction and training of deep ANNs, and also the expected uncertainty in this process [3]. One important finding is that although the performance variance caused by different seeds can be relatively small, outliers are easily found, i.e., models with a performance much above or below the average. However, the element that poses the highest degree of freedom in this context (at least in theory), the distribution of initial random weights, is overlooked by most of the current research. Some intriguing properties have also been observed, for instance, specific subsets of random weights that make training of sparse ANNs particularly effective [4]. More surprisingly, these random weights may not even require additional training [5, 6]. It has also been shown that successfully trained ANNs usually converge to a neighborhood of weights close to their initial configuration [7, 8]. These works corroborate the importance of initial weights and also point toward the existence of particular random structures related to better initial models. Nevertheless, most works initialize ANNs with simple methods that only define bounds for random weight sampling. Moreover, the effects caused by the randomness of these methods are most of the time ignored, i.e., researchers arbitrarily choose to consider either a single

Insights into randomized algorithms for neural networks: Practical issues and common pitfalls

Information Sciences, 2017

Random Vector Functional-link (RVFL) networks, a class of learner models, can be regarded as feed-forward neural networks built with a specific randomized algorithm, i.e., the input weights and biases are randomly assigned and fixed during the training phase, and the output weights are analytically evaluated by the least square method. In this paper, we provide some insights into RVFL networks and highlight some practical issues and common pitfalls associated with RVFL-based modelling techniques. Inspired by the folklore that "all high-dimensional random vectors are almost always nearly orthogonal to each other", we establish a theoretical result on the infeasibility of RVFL networks for universal approximation, if a RVFL network is built incrementally with random selection of the input weights and biases from a fixed scope, and constructive evaluation of its output weights. This work also addresses the significance of the scope setting of random weights and biases in respect to modelling performance. Two numerical examples are employed to illustrate our findings, which theoretically and empirically reveal some facts and limits of such class of randomized learning algorithms.

Simplicity bias in the parameter-function map of deep neural networks

2019

The idea that neural networks may exhibit a bias towards simplicity has a long history (1; 2; 3; 4). Simplicity bias (5) provides a way to quantify this intuition. It predicts, for a broad class of input-output maps which can describe many systems in science and engineering, that simple outputs are exponentially more likely to occur upon uniform random sampling of inputs than complex outputs are. This simplicity bias behaviour has been observed for systems ranging from the RNA sequence to secondary structure map, to systems of coupled differential equations, to models of plant growth. Deep neural networks can be viewed as a mapping from the space of parameters (the weights) to the space of functions (how inputs get transformed to outputs by the network). We show that this parameter-function map obeys the necessary conditions for simplicity bias, and numerically show that it is hugely biased towards functions with low descriptional complexity. We also demonstrate a Zipf like power-la...

Efficient Design of Neural Networks with Random Weights

ArXiv, 2020

Single layer feedforward networks with random weights are known for their non-iterative and fast training algorithms and are successful in a variety of classification and regression problems. A major drawback of these networks is that they require a large number of hidden units. In this paper, we propose a technique to reduce the number of hidden units substantially without affecting the accuracy of the networks significantly. We introduce the concept of primary and secondary hidden units. The weights for the primary hidden units are chosen randomly while the secondary hidden units are derived using pairwise combinations of the primary hidden units. Using this technique, we show that the number of hidden units can be reduced by at least one order of magnitude. We experimentally show that this technique leads to significant drop in computations at inference time and has only a minor impact on network accuracy. A huge reduction in computations is possible if slightly lower accuracy is...

Adaptive rewiring of random neural networks generates convergent-divergent units

2021

Brain networks are adaptively rewired continually, adjusting their topology to bring about functionality and efficiency in sensory, motor and cognitive tasks. In model neural network architectures, adaptive rewiring generates complex, brain-like topologies. Present models, however, cannot account for the emergence of complex directed connectivity structures. We tested a biologically plausible model of adaptive rewiring in directed networks, based on two algorithms widely used in distributed computing: advection and consensus. When both are used in combination as rewiring criteria, adaptive rewiring shortens path length and enhances connectivity. When keeping a balance between advection and consensus, adaptive rewiring produces convergent-divergent units consisting of convergent hub nodes, which collect inputs from pools of sparsely connected, or local, nodes and project them via densely interconnected processing nodes onto divergent hubs that broadcast output back to the local pools...