A Method on Searching Better Activation Functions (original) (raw)

\definechangesauthor

[color=purple, name=Haoyuan Sun]SHY

Haoyuan Sun∗,1, Zihao Wu∗,2,Bo Xia1,Pu Chang3,Zibin Dong2,Yifu Yuan2,
Yongzhe Chang†,1,Xueqian Wang†,1
∗equal contribution†corresponding authors
1Tsinghua University 2Tianjin University 3Anhui Polytechnic University

Abstract

The success of artificial neural networks (ANNs) hinges greatly on the judicious selection of an activation function, introducing non-linearity into network and enabling them to model sophisticated relationships in data. However, the search of activation functions has largely relied on empirical knowledge in the past, lacking theoretical guidance, which has hindered the identification of more effective activation functions. In this work, we offer a proper solution to such issue. Firstly, we theoretically demonstrate the existence of the worst activation function with boundary conditions (WAFBC) from the perspective of information entropy. Furthermore, inspired by the Taylor expansion form of information entropy functional, we propose the Entropy-based Activation Function Optimization (EAFO) methodology. EAFO methodology presents a novel perspective for designing static activation functions in deep neural networks and the potential of dynamically optimizing activation during iterative training. Utilizing EAFO methodology, we derive a novel activation function from ReLU, known as Correction Regularized ReLU (CRReLU). Experiments conducted with vision transformer and its variants on CIFAR-10, CIFAR-100 and ImageNet-1K datasets demonstrate the superiority of CRReLU over existing corrections of ReLU. Extensive empirical studies on task of large language model (LLM) fine-tuning, CRReLU exhibits superior performance compared to GELU, suggesting its broader potential for practical applications.

1 Introduction

Flourishing development of artificial intelligence is predominantly attributable to rapid advancements in artificial neural networks (ANNs) observed in recent years. Activation functions (AFs) play a critical role in the performance of ANNs due to their fundamental role in enabling nonlinear representations. Despite continuous development of novel activation functions and their empirical success in improving network performance, theoretical analysis towards these activation functions remain scarce in the research literature. In other words, proposal of improved activation functions is often based on empirical evidence without theoretical validations, which greatly hinders the search for better activation functions. Hence, a theoretically reliable methodology on searching better activation functions holds significant value for the machine learning community and future research.

In our work, we initiate our exploration from the correlation between information entropy and Bayesian error rate. Subsequently, we establish the connection between activation function and information entropy, ultimately deriving the specific form that the worst activation function does exist under boundary conditions. Based on the derivation, we propose a novel method for optimizing activation functions, namely the Entropy-based Activation Function Optimization(EAFO) methodology. Utilizing EAFO methodology, we derive a novel activation function known as Correction Regularized ReLU (CRReLU) with the beginning of conventional ReLU [1, 2, 3]. The derived CRReLU activation function possesses several desirable properties, including the avoidance of neuron death, the preservation of neuron sparsity, and so on. Experiments involving the vision transformer [4] and its variants [5, 6], conducted on CIFAR-10, CIFAR-100 [7] and ImageNet-1K [8] datasets, have consistently demonstrated the superior performance of CRReLU compared to other activation function baselines. Extensive experimental studies on the task of large language model (LLM) fine-tuning with direct preference optimization (DPO) method [9] also demonstrate that CRReLU surpasses GELU in performance, suggesting the wider applicability of CRReLU in practical scenarios. Moreover, the EAFO methodology also shows potential to further optimize activation functions during the iterative training of ANNs, although the specific optimization techniques remain a topic of ongoing research.

In summary, our main contributions are as follows:

With the development of deep learning, deep neural networks (DNNs) have gained significant prominence and achieved notable success across various domains. Recent advancements in the field of natural language processing, exemplified by large language models such as GPT-4 [10], LLama-3 [11], and Gemini [12], have propelled machine understanding and generation of natural language to unprecedented levels of accuracy. Additionally, deep neural networks have also achieved important applications in computer vision [4, 5, 6], deep reinforcement learning [13], autonomous driving[14], and many other areas.

The nonlinearity of activation functions in neural networks is crucial for both enabling the efficient learning of complex patterns, and facilitating the extraction of intricate and hierarchical representations from input data, thus allowing them to capture more complex relationships between input and output variables. In contrast, however, the nonlinear activation functions of deep neural networks also presents challenges during training, encompassing challenges like gradient vanishing [15], gradient exploding [16], and so on.

To address these challenges, researchers have explored alternative approaches for improvement, including the enhancement of activation functions. In the nascent stages of activation function development, scholars predominantly focused on rudimentary thresholding functions, initially directing their attention towards squashing functions such as the Sigmoid function and the Tanh function [17]. In order to mitigate the issues of vanishing and exploding gradients, various non-squashing functions have been proposed. Notably, ReLU [1, 2, 3] has played a pivotal role in the remarkable success of deep learning. The derivative of ReLU for positive inputs is one, thereby preventing the gradient from vanishing; however, negative values are mapped to zero, leading to two main issues: (1) The absence of information flow for negative values, known as dying ReLU ; (2) The shift in subsequent layers due to positive bias maintained by activation.

Given the aforementioned challenges, researchers have dedicated significant efforts to improving the effectiveness of activation functions. The Leaky ReLU [18] activation function permits a small negative slope, ensuring some gradient can still be propagated even when input is less than zero. The Parametric ReLU (PReLU) [19] is an extension of the Leaky ReLU, where α𝛼\alphaitalic_α is considered a learnable parameter that is learned from data rather than being predetermined. The Exponential Linear Unit (ELU) [20] outputs a negative value when x𝑥xitalic_x is less than 0, leading to the advantageous property of the average output approaching 0. The Continuously Differentiable Exponential Linear Unit (CELU) [21] proposes an alternative parameterization that simplifies analysis of the rectifier function and facilitates the tuning process of parameters in ELU. The Swish (also known as SiLU) [22] has been shown to enhance training stability and performance in deep learning models due to its smooth nature and improved gradient propagation. In Mish [23] activation function, unboundedness of positive values avoids the saturation led by a plateau, slight allowance for negative values enables better gradient flow, and the smoother activation function allows better information to flow deep into neural networks, thus resulting in better accuracy and generalization in performance.

3 Motivation

In Section 2, it is apparent that researchers have dedicated substantial efforts to the exploration of improved activation functions, which are widely acknowledged to hold considerable significance for the advancement of deep learning. However, it has also come to our attention that proposals for these activation functions lack a theoretical framework, indicating such searches are, to some extent, inefficient and aimless.

GELU(Gaussian Error Linear Unit)[24] was first proposed in 2016 and has since gained significant success in a variety of fields, especially with the emergence of large language models in recent years. It has been successfully incorporated into several cutting-edge neural network architectures, such as BERT[25] , ViT [4] , GPT-4[10] , and so on, demonstrating its versatility and effectiveness. In the work conducted by Lee [26] (2023), insightful mathematical properties of the GELU are finally unveiled, including its differentiability, boundedness, stationarity, and smoothness. Hence, it is often the case that superior performance exhibited by novel activation functions frequently lacks mathematical explanations for their observed enhancements. Understanding may merely limited to the fact that it exhibits improved performance, which hampers exploration for better activation functions and interpretability of neural networks.

In light of the aforementioned challenges, our work endeavors to propose a methodology for searching better activation functions, not only enabling the discovery of improved activation functions but also elucidating the reasons behind their superior performance at the same time.

4 Methodology

4.1 Problem Setup

4.1.1 Bayesian Error Rate and Information Entropy

A deep neural network can be simplified as comprising a feature extraction layer, which is subsequently followed by a fully connected layer for final classification. From a probabilistic perspective, in binary classification, the feature extraction layer can be conceptualized as transforming the shape of mixture distribution, thereby enabling the final fully connected layer to separate two distributions with a hyperplane. Hence, the more overlapping two distributions are, the higher Bayesian error rate and the worse classification performance. Furthermore, a lower information entropy corresponds to a higher likelihood of forming two distinct peaks (i.e. the smaller classification uncertainty, the easier to classify); and an increase in the overlap between two distributions also leads to the increase of information entropy (i.e. the greater classification uncertainty, the harder to classify). In addition, the above statements can be extended to multi-class classification, and further elaboration is omitted here.

4.1.2 Activation Function and Information Entropy

Assuming the inverse function of the activation function is y⁢(x)𝑦𝑥y(x)italic_y ( italic_x ), and the activation function is monotonically increasing. Many previous activation functions, such as Sigmoid and Tanh [17], satisfy the assumption that the function has an inverse function in entire definition domain. Furthermore, when an activation function fails to meet the assumption, we can transform the part of such function satisfying this assumption, as is the case with the positive part of ReLU.

Then we set data distribution before passing through the activation function obeys the distribution p⁢(x)𝑝𝑥p(x)italic_p ( italic_x ). Thus, data distribution after passing through activation function is : q⁢(x)=p⁢(y⁢(x))⁢y′⁢(x)𝑞𝑥𝑝𝑦𝑥superscript𝑦′𝑥q(x)=p\left(y(x)\right)y^{\prime}(x)italic_q ( italic_x ) = italic_p ( italic_y ( italic_x ) ) italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ), where y′⁢(x)superscript𝑦′𝑥y^{\prime}(x)italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) represents the derivative of y⁢(x)𝑦𝑥y(x)italic_y ( italic_x ). Hence, we can express the information entropy as:

ℍ⁢(y⁢(x))=−∫q⁢(x)⁢log⁡q⁢(x)⁢d⁢x=−∫p⁢(y⁢(x))⁢y′⁢(x)⁢log⁡(p⁢(y⁢(x))⁢y′⁢(x))⁢d⁢x=∫𝔾⁢(y′⁢(x),y⁢(x))⁢d⁢xℍ𝑦𝑥𝑞𝑥𝑞𝑥d𝑥𝑝𝑦𝑥superscript𝑦′𝑥𝑝𝑦𝑥superscript𝑦′𝑥d𝑥𝔾superscript𝑦′𝑥𝑦𝑥d𝑥\mathbb{H}(y(x))=-\int q(x)\log q(x)\text{d}x=-\int p(y(x))y^{\prime}(x)\log(p% (y(x))y^{\prime}(x))\text{d}x=\int\mathbb{G}(y^{\prime}(x),y(x))\text{d}xblackboard_H ( italic_y ( italic_x ) ) = - ∫ italic_q ( italic_x ) roman_log italic_q ( italic_x ) d italic_x = - ∫ italic_p ( italic_y ( italic_x ) ) italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) roman_log ( italic_p ( italic_y ( italic_x ) ) italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) d italic_x = ∫ blackboard_G ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) , italic_y ( italic_x ) ) d italic_x

Therefore, the information entropy can be deemed as a functional, which takes a function y⁢(x)𝑦𝑥y(x)italic_y ( italic_x ) as input and produces a real number as output.

4.2 Worst Activation Function with Boundary Condition (WAFBC)

Firstly, we would like to determine the extremum (whether it is a maximum or minimum) of the functional ℍ⁢(y⁢(x))ℍ𝑦𝑥\mathbb{H}(y(x))blackboard_H ( italic_y ( italic_x ) ). For further deductions, taking the simplest functional into consideration, e.g. setting ℍ⁢(y⁢(x))=∫𝔾⁢(y′⁢(x),y⁢(x),x)⁢d⁢xℍ𝑦𝑥𝔾superscript𝑦′𝑥𝑦𝑥𝑥d𝑥\mathbb{H}(y(x))=\int\mathbb{G}\left(y^{\prime}(x),y(x),x\right)\text{d}xblackboard_H ( italic_y ( italic_x ) ) = ∫ blackboard_G ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) , italic_y ( italic_x ) , italic_x ) d italic_x.

In order to research the influence brought by variations of function y⁢(x)𝑦𝑥y(x)italic_y ( italic_x ), we apply a small perturbation ε⁢η⁢(x)𝜀𝜂𝑥\varepsilon\eta(x)italic_ε italic_η ( italic_x ) to function y⁢(x)𝑦𝑥y(x)italic_y ( italic_x ), and then the functional ℍ⁢(y⁢(x)+ε⁢η⁢(x))ℍ𝑦𝑥𝜀𝜂𝑥\mathbb{H}\left(y(x)+\varepsilon\eta(x)\right)blackboard_H ( italic_y ( italic_x ) + italic_ε italic_η ( italic_x ) ) takes the form as:

ℍ⁢(y⁢(x)+ε⁢η⁢(x))=∫𝔾⁢(y′⁢(x)+ε⁢η′⁢(x),y⁢(x)+ε⁢η⁢(x),x)⁢d⁢xℍ𝑦𝑥𝜀𝜂𝑥𝔾superscript𝑦′𝑥𝜀superscript𝜂′𝑥𝑦𝑥𝜀𝜂𝑥𝑥d𝑥\mathbb{H}\left(y(x)+\varepsilon\eta(x)\right)=\int\mathbb{G}(y^{\prime}(x)+% \varepsilon\eta^{\prime}(x),y(x)+\varepsilon\eta(x),x)\text{d}xblackboard_H ( italic_y ( italic_x ) + italic_ε italic_η ( italic_x ) ) = ∫ blackboard_G ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) + italic_ε italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) , italic_y ( italic_x ) + italic_ε italic_η ( italic_x ) , italic_x ) d italic_x

We apply Taylor expansion to functional ℍ⁢(y⁢(x)+ε⁢η⁢(x))ℍ𝑦𝑥𝜀𝜂𝑥\mathbb{H}\left(y(x)+\varepsilon\eta(x)\right)blackboard_H ( italic_y ( italic_x ) + italic_ε italic_η ( italic_x ) ), we can obtain the following equation:

ℍ⁢(y⁢(x)+ε⁢η⁢(x))=∫[𝔾⁢(y′⁢(x),y⁢(x),x)+ε⁢∂𝔾∂y′⁢η′⁢(x)+ε⁢∂𝔾∂y⁢η⁢(x)+𝒪⁢(ε)]⁢d⁢x=ℍ⁢(y⁢(x))+ε⁢∫[∂𝔾∂y⁢η⁢(x)+∂𝔾∂y′⁢η′⁢(x)]⁢d⁢x+𝒪⁢(ε)ℍ𝑦𝑥𝜀𝜂𝑥delimited-[]𝔾superscript𝑦′𝑥𝑦𝑥𝑥𝜀𝔾superscript𝑦′superscript𝜂′𝑥𝜀𝔾𝑦𝜂𝑥𝒪𝜀d𝑥ℍ𝑦𝑥𝜀delimited-[]𝔾𝑦𝜂𝑥𝔾superscript𝑦′superscript𝜂′𝑥d𝑥𝒪𝜀\begin{split}&\mathbb{H}\left(y(x)+\varepsilon\eta(x)\right)=\int\left[\mathbb% {G}(y^{\prime}(x),y(x),x)+\varepsilon\frac{\partial\mathbb{G}}{\partial y^{% \prime}}\eta^{\prime}(x)+\varepsilon\frac{\partial\mathbb{G}}{\partial y}\eta(% x)+\mathcal{O}(\varepsilon)\right]\text{d}x\\ =&\mathbb{H}(y(x))+\varepsilon\int\left[\frac{\partial\mathbb{G}}{\partial y}% \eta(x)+\frac{\partial\mathbb{G}}{\partial y^{\prime}}\eta^{\prime}(x)\right]% \text{d}x+\mathcal{O}(\varepsilon)\end{split}start_ROW start_CELL end_CELL start_CELL blackboard_H ( italic_y ( italic_x ) + italic_ε italic_η ( italic_x ) ) = ∫ [ blackboard_G ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) , italic_y ( italic_x ) , italic_x ) + italic_ε divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) + italic_ε divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y end_ARG italic_η ( italic_x ) + caligraphic_O ( italic_ε ) ] d italic_x end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL blackboard_H ( italic_y ( italic_x ) ) + italic_ε ∫ [ divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y end_ARG italic_η ( italic_x ) + divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ] d italic_x + caligraphic_O ( italic_ε ) end_CELL end_ROW (1)

As illustrated in Section 4.1.2, q⁢(x)=p⁢(y⁢(x))⁢y′⁢(x)𝑞𝑥𝑝𝑦𝑥superscript𝑦′𝑥q(x)=p\left(y(x)\right)y^{\prime}(x)italic_q ( italic_x ) = italic_p ( italic_y ( italic_x ) ) italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) is the data distribution after passing through activation function. We can easily get that for the inverse function y⁢(x)𝑦𝑥y(x)italic_y ( italic_x ) of activation function, when x𝑥xitalic_x approaches the lower bound (e.g. the initial activation function value approaches lower bound), y⁢(x)𝑦𝑥y(x)italic_y ( italic_x ) should approaches negative infinity; and when x𝑥xitalic_x approaches the upper bound (e.g. the initial activation function value approaches upper bound), y⁢(x)𝑦𝑥y(x)italic_y ( italic_x ) should approaches positive infinity. And for the sake that ε⁢η⁢(x)𝜀𝜂𝑥\varepsilon\eta(x)italic_ε italic_η ( italic_x ) is a small perturbation applied to y⁢(x)𝑦𝑥y(x)italic_y ( italic_x ), we can draw the conclusion that η⁢(x)𝜂𝑥\eta(x)italic_η ( italic_x ) must be 0 at the boundaries.

Utilizing the method of integration by parts and boundary condition towards Equation 1, we can derive the following results:

| ∫∂𝔾∂y′⁢η′⁢(x)⁢d⁢x=∫∂𝔾∂y′⁢d⁢η⁢(x)=η⁢(x)⁢∂𝔾∂y′|x−∫η⁢(x)⁢dd⁢x⁢(∂𝔾∂y′)⁢d⁢x=−∫η⁢(x)⁢dd⁢x⁢(∂𝔾∂y′)⁢d⁢x𝔾superscript𝑦′superscript𝜂′𝑥d𝑥𝔾superscript𝑦′d𝜂𝑥evaluated-at𝜂𝑥𝔾superscript𝑦′𝑥𝜂𝑥dd𝑥𝔾superscript𝑦′d𝑥𝜂𝑥dd𝑥𝔾superscript𝑦′d𝑥\int\frac{\partial\mathbb{G}}{\partial y^{\prime}}\eta^{\prime}(x)\text{d}x=% \int\frac{\partial\mathbb{G}}{\partial y^{\prime}}\text{d}\eta(x)=\eta(x)\frac% {\partial\mathbb{G}}{\partial y^{\prime}}\bigg{|}_{x}-\int\eta(x)\frac{\text{d% }}{\text{d}x}\left(\frac{\partial\mathbb{G}}{\partial y^{\prime}}\right)\text{% d}x=-\int\eta(x)\frac{\text{d}}{\text{d}x}\left(\frac{\partial\mathbb{G}}{% \partial y^{\prime}}\right)\text{d}x∫ divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG italic_η start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) d italic_x = ∫ divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG d italic_η ( italic_x ) = italic_η ( italic_x ) divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG | start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - ∫ italic_η ( italic_x ) divide start_ARG d end_ARG start_ARG d italic_x end_ARG ( divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) d italic_x = - ∫ italic_η ( italic_x ) divide start_ARG d end_ARG start_ARG d italic_x end_ARG ( divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) d italic_x | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

Thus, ℍ⁢(y⁢(x)+ε⁢η⁢(x))ℍ𝑦𝑥𝜀𝜂𝑥\mathbb{H}\left(y(x)+\varepsilon\eta(x)\right)blackboard_H ( italic_y ( italic_x ) + italic_ε italic_η ( italic_x ) ) has the following expression:

ℍ⁢(y⁢(x)+ε⁢η⁢(x))=ℍ⁢(y⁢(x))+ε⁢∫[∂𝔾∂y−dd⁢x⁢(∂𝔾∂y′)]⁢η⁢(x)⁢d⁢x+𝒪⁢(ε)ℍ𝑦𝑥𝜀𝜂𝑥ℍ𝑦𝑥𝜀delimited-[]𝔾𝑦dd𝑥𝔾superscript𝑦′𝜂𝑥d𝑥𝒪𝜀\mathbb{H}\left(y(x)+\varepsilon\eta(x)\right)=\mathbb{H}(y(x))+\varepsilon% \int\left[\frac{\partial\mathbb{G}}{\partial y}-\frac{\text{d}}{\text{d}x}% \left(\frac{\partial\mathbb{G}}{\partial y^{\prime}}\right)\right]\eta(x)\text% {d}x+\mathcal{O}(\varepsilon)blackboard_H ( italic_y ( italic_x ) + italic_ε italic_η ( italic_x ) ) = blackboard_H ( italic_y ( italic_x ) ) + italic_ε ∫ [ divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y end_ARG - divide start_ARG d end_ARG start_ARG d italic_x end_ARG ( divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) ] italic_η ( italic_x ) d italic_x + caligraphic_O ( italic_ε )

In analogy to the extremum of ordinary functions, it is expected that the first-order term should be 0 at the extremum point. Such requirement for arbitrary η⁢(x)𝜂𝑥\eta(x)italic_η ( italic_x ) leads to the Euler-Lagrange equation:

dd⁢x⁢(∂𝔾∂y′)−∂𝔾∂y=0dd𝑥𝔾superscript𝑦′𝔾𝑦0\frac{\text{d}}{\text{d}x}(\frac{\partial\mathbb{G}}{\partial y^{\prime}})-% \frac{\partial\mathbb{G}}{\partial y}=0divide start_ARG d end_ARG start_ARG d italic_x end_ARG ( divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) - divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y end_ARG = 0 (2)
Proposition 1.

If 𝔾𝔾\mathbb{G}blackboard_G is independent of x𝑥xitalic_x, i.e. 𝔾=𝔾⁢(y,y′)𝔾𝔾𝑦superscript𝑦′\mathbb{G}=\mathbb{G}(y,y^{\prime})blackboard_G = blackboard_G ( italic_y , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), based on the Euler-Lagrange equation expressed in Equation 2, then we have:

𝔾−y′⁢∂𝔾∂y′=C𝔾superscript𝑦′𝔾superscript𝑦′𝐶\mathbb{G}-y^{\prime}\frac{\partial\mathbb{G}}{\partial y^{\prime}}=Cblackboard_G - italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG = italic_C (3)

Detailed proof of Proposition 1 can be seen in Appendix A.

Substitute 𝔾=p⁢(y⁢(x))⁢y′⁢(x)⁢log⁡(p⁢(y⁢(x))⁢y′⁢(x))𝔾𝑝𝑦𝑥superscript𝑦′𝑥𝑝𝑦𝑥superscript𝑦′𝑥\mathbb{G}=p(y(x))y^{\prime}(x)\log(p(y(x))y^{\prime}(x))blackboard_G = italic_p ( italic_y ( italic_x ) ) italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) roman_log ( italic_p ( italic_y ( italic_x ) ) italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) into Equation 3 and perform the calculation, the final result is:

d⁢yd⁢x⁢p⁢(y⁢(x))=Cd𝑦d𝑥𝑝𝑦𝑥𝐶\frac{\text{d}y}{\text{d}x}p(y(x))=Cdivide start_ARG d italic_y end_ARG start_ARG d italic_x end_ARG italic_p ( italic_y ( italic_x ) ) = italic_C

Integrating both sides of the equation simultaneously, the final solution is:

x=c1⁢∫p⁢(y)⁢d⁢y+c2𝑥subscript𝑐1𝑝𝑦d𝑦subscript𝑐2x=c_{1}\int p(y)\text{d}y+c_{2}italic_x = italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∫ italic_p ( italic_y ) d italic_y + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (4)

Based on the solution we get in Equation 4, for the sake that y⁢(x)𝑦𝑥y(x)italic_y ( italic_x ) is the inverse function of the activation function, the first integral equation can finally be solved to obtain the form of the activation function as:

f⁢(x)=C1⁢∫−∞xp⁢(t)⁢d⁢t+C2𝑓𝑥subscript𝐶1superscriptsubscript𝑥𝑝𝑡d𝑡subscript𝐶2f(x)=C_{1}{\int_{-\infty}^{x}p(t)\text{d}t}+C_{2}italic_f ( italic_x ) = italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT italic_p ( italic_t ) d italic_t + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (5)

, where C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are two constants based on the upper bound and lower bound of activation function.

Equation 5 shows the analytical form of the worst activation function with boundary condition. We provide further discussion on this form in Appendix B. Through the above derivation, extremum of the functional is determined. Furthermore, we would like to deduce whether it is a maximum value or a minimum one. Applying Legendre condition to the functional extremum, then we have:

𝔾y′⁢y′=−p⁢(y⁢(x))y′⩽0subscript𝔾superscript𝑦′superscript𝑦′𝑝𝑦𝑥superscript𝑦′0\mathbb{G}_{y^{\prime}y^{\prime}}=-\frac{p(y(x))}{y^{\prime}}\leqslant 0blackboard_G start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = - divide start_ARG italic_p ( italic_y ( italic_x ) ) end_ARG start_ARG italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ⩽ 0

Therefore, the derived extremum is a maximum extremum, and is a global maximum extremum actually, meaning the deduced activation function has the worst performance. Actually, the WAFBC possesses some intriguing properties, for example, it inherently has upper and lower bounds, which can explain why bounded activation functions like Sigmoid and Tanh do not perform as well as unbounded functions like ReLU.

4.3 Entropy-based Activation Function Optimization (EAFO)

In Section 4.2, we have derived the extremum of the functional, showing the analytical form in Equation 5. However, the solution obtained is the global maximum, rather than the minimum. The minimum of the functional is needed if we would like to obtain the best activation function. Nonetheless, based on calculation, the actual situation is that this functional only has a global maximum but no global minimum exists. Hence, there is no best activation function, but only better activation functions. In this scenario, WAFBC represents a global maximum of the functional, indicating that the performance of activation functions consistently improves from WAFBC to any alternative activation functions. Therefore, we propose the following question: Is there a methodology to begin with an existing, high-performing activation function, and subsequently develop an activation function with superior performance?

Let’s reconsider the Taylor expansion of the functional

ℍ⁢(y⁢(x)+ε⁢η⁢(x))=ℍ⁢(y⁢(x))+ε⁢∫[∂𝔾∂y−dd⁢x⁢(∂𝔾∂y′)]⁢η⁢(x)⁢d⁢x+𝒪⁢(ε)ℍ𝑦𝑥𝜀𝜂𝑥ℍ𝑦𝑥𝜀delimited-[]𝔾𝑦dd𝑥𝔾superscript𝑦′𝜂𝑥d𝑥𝒪𝜀\mathbb{H}(y(x)+\varepsilon\eta(x))=\mathbb{H}(y(x))+\varepsilon\int\left[% \frac{\partial\mathbb{G}}{\partial y}-\frac{\text{d}}{\text{d}x}\left(\frac{% \partial\mathbb{G}}{\partial y^{\prime}}\right)\right]\eta(x)\text{d}x+% \mathcal{O}(\varepsilon)blackboard_H ( italic_y ( italic_x ) + italic_ε italic_η ( italic_x ) ) = blackboard_H ( italic_y ( italic_x ) ) + italic_ε ∫ [ divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y end_ARG - divide start_ARG d end_ARG start_ARG d italic_x end_ARG ( divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) ] italic_η ( italic_x ) d italic_x + caligraphic_O ( italic_ε )

To minimize the information entropy of novel activation function, it is advisable to reduce the first-order term of Taylor expansion. In order to ensure that the information entropy of novel activation function has been indeed reduced, we would like to set η⁢(x)𝜂𝑥\eta(x)italic_η ( italic_x ) as the opposite sign to ∂𝔾∂y−dd⁢x⁢(∂𝔾∂y′)𝔾𝑦dd𝑥𝔾superscript𝑦′\frac{\partial\mathbb{G}}{\partial y}-\frac{\text{d}}{\text{d}x}\left(\frac{% \partial\mathbb{G}}{\partial y^{\prime}}\right)divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y end_ARG - divide start_ARG d end_ARG start_ARG d italic_x end_ARG ( divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ), which means we set:

η⁢(x)=−(∂𝔾∂y−dd⁢x⁢(∂𝔾∂y′))𝜂𝑥𝔾𝑦dd𝑥𝔾superscript𝑦′\eta(x)=-\left(\frac{\partial\mathbb{G}}{\partial y}-\frac{\text{d}}{\text{d}x% }\left(\frac{\partial\mathbb{G}}{\partial y^{\prime}}\right)\right)italic_η ( italic_x ) = - ( divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y end_ARG - divide start_ARG d end_ARG start_ARG d italic_x end_ARG ( divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) ) (6)

Substitute the analytical form of functional 𝔾⁢(y′⁢(x),y⁢(x))𝔾superscript𝑦′𝑥𝑦𝑥\mathbb{G}(y^{\prime}(x),y(x))blackboard_G ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) , italic_y ( italic_x ) ) into Equation 6, perform the calculation, we can derive the following equation:

η⁢(x)=−(p⁢(y⁢(x))⁢y′′⁢(x)y′⁢(x)+p′⁢(y⁢(x))⁢y′⁢(x))𝜂𝑥𝑝𝑦𝑥superscript𝑦′′𝑥superscript𝑦′𝑥superscript𝑝′𝑦𝑥superscript𝑦′𝑥\eta(x)=-\left(p(y(x))\frac{y^{\prime\prime}(x)}{y^{\prime}(x)}+p^{\prime}(y(x% ))y^{\prime}(x)\right)italic_η ( italic_x ) = - ( italic_p ( italic_y ( italic_x ) ) divide start_ARG italic_y start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_x ) end_ARG start_ARG italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) end_ARG + italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ( italic_x ) ) italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) (7)

, where p⁢(x)𝑝𝑥p(x)italic_p ( italic_x ) is the probability density function (PDF) of data distribution before passing through the activation function; p′⁢(x)superscript𝑝′𝑥p^{\prime}(x)italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) is the first order derivative of PDF; y⁢(x)𝑦𝑥y(x)italic_y ( italic_x ) is inverse function of the activation function; y′⁢(x)superscript𝑦′𝑥y^{\prime}(x)italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) is the first order derivative of y⁢(x)𝑦𝑥y(x)italic_y ( italic_x ); y′′⁢(x)superscript𝑦′′𝑥y^{\prime\prime}(x)italic_y start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_x ) is the second order derivative of y⁢(x)𝑦𝑥y(x)italic_y ( italic_x ).

As a result, we have derived a correction term that is capable of decreasing information entropy, expressing its general form in Equation 7. Subsequently, we can obtain the inverse function of the optimized activation function, denoted as g⁢(x)=y⁢(x)+η⁢(x)𝑔𝑥𝑦𝑥𝜂𝑥g(x)=y(x)+\eta(x)italic_g ( italic_x ) = italic_y ( italic_x ) + italic_η ( italic_x ). Finally, the optimized activation function can be obtained by deriving the inverse function of g⁢(x)𝑔𝑥g(x)italic_g ( italic_x ).

EAFO methodology outline . In summary, we express the theoretical EAFO methodology as follows: 1) Utilize Equation 7 and derive correction term η⁢(x)𝜂𝑥\eta(x)italic_η ( italic_x ) given data distribution p⁢(y)𝑝𝑦p(y)italic_p ( italic_y ) and inverse function of activation function y⁢(x)𝑦𝑥y(x)italic_y ( italic_x ). 2) Sum the correction term with the inverse function to obtain the inverse function of the optimized function, i.e. g⁢(x)=y⁢(x)+η⁢(x)𝑔𝑥𝑦𝑥𝜂𝑥g(x)=y(x)+\eta(x)italic_g ( italic_x ) = italic_y ( italic_x ) + italic_η ( italic_x ) . 3) Derive the rigorous or approximate inverse function of g⁢(x)𝑔𝑥g(x)italic_g ( italic_x ), yielding the optimized activation function.

Furthermore, EAFO methodology has also shown the potential of dynamically optimizing activation during iterative training. We are acknowledged that activation of neural networks with Multi-Layer Perceptrons (MLPs) architecture is typically fixed. Recent studies, such as work done by Liu et al. [27], have suggested the optimization of activation in innovative network architectures (Kolmogorov-Arnold Networks). Furthermore, across true data distributions p⁢(y)𝑝𝑦p(y)italic_p ( italic_y ), utilizing EAFO methodology, we may continuously optimize activation y⁢(x)𝑦𝑥y(x)italic_y ( italic_x ) practically under Multi-Layer Perceptrons (MLPs) architecture with numerical methods. Moreover, in theory, it is feasible to optimize activation functions using methods such as gradient descent optimization of the information entropy functional through numerical methods; however, we are also aware that this would result in an explosion of computational complexity in large neural networks, which calls for practically efficient algorithms. Hence, the EAFO methodology is still in the theoretical stage presently, providing guidance for calculating the analytical form of better activation functions.

4.4 Correction Regularized ReLU (CRReLU) : From ReLU to Better

As illustrated in Section 4.2, it is theoretically true that the worst activation function exists, and we can determine its exact form. Actually, beginning with the worst activation function, the value of the functional 𝔾𝔾\mathbb{G}blackboard_G consistently decreases, indicating an improvement in the performance of activation function. This reveals the feasibility of searching an improved activation function, which constitutes the crux of "optimization". In Section 4.3, EAFO is proposed as the optimization methodology. Hence, we can easily think of optimizing from WAFBC to get a better-performing activation function. While it is true that such an idea is feasible, we also observe that WAFBC itself takes the form of a variable upper bound integral, which yields a complex form of η⁢(x)𝜂𝑥\eta(x)italic_η ( italic_x ) and renders the deduced result not practically significant. Moreover, commencing optimization from WAFBC also leads to sluggish advancement. Therefore, in practical applications, we are inclined to start from an activation function that already demonstrates relatively good performance.

Here, we would like to take ReLU [1, 2, 3] as the beginning, and show the process of finding a better activation function. Before the deduction, we also notice that ReLU is lack of an inverse function over the entire domain. In this section, we would like to utilize following strategies for mitigating the aforementioned dilemma: the initial activation function only necessitates an inverse function in specific regions where it is required; and when encountering parts without an inverse function, we may employ practical approximations. Therefore, we initially examine the region where x𝑥xitalic_x is positive in the case of ReLU. As shown in Equation 7, the derivation of correction term η⁢(x)𝜂𝑥\eta(x)italic_η ( italic_x ) only requires original distribution p⁢(y)𝑝𝑦p(y)italic_p ( italic_y ) and inverse function of the activation function y⁢(x)𝑦𝑥y(x)italic_y ( italic_x ). Knowledge of activation function is easily available, whereas original distribution remains unexplored. However, in real experiments, original distribution of experimental data would surely exhibit a substantial degree of morphological variability, thus lacking a perfect analytical form. Hence, we assume the situation is that networks are large enough, according to the Central Limit Theorem, the data processed by them can be approximated as a Gaussian distribution [28, 29, 30, 31][32]. Certainly, such assumption may not always hold in networks of real experiments; nevertheless, approximation of the exact solution for inverse function and existence of the learnable parameter ϵitalic-ϵ\epsilonitalic_ϵ have significantly mitigated the impact of such assumption, which can also be demonstrated by the insensitivity of CRReLU to data distribution shown in Section 5.

Now, let’s consider the derivation from ReLU to CRReLU. For the sake of concise representation, we rewrite the data distribution and the derivative of data distribution as:

p(y)=C⋅e−y22,p′(y)=−C⋅ye−y22p(y)=C\cdot e^{-\frac{y^{2}}{2}}\quad,\quad p^{\prime}(y)=-C\cdot ye^{-\frac{y% ^{2}}{2}}italic_p ( italic_y ) = italic_C ⋅ italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) = - italic_C ⋅ italic_y italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT

Furthermore, ReLU has a mathematical function defined as y=x𝑦𝑥y=xitalic_y = italic_x when x𝑥xitalic_x is positive, meaning we have y⁢(x)=x𝑦𝑥𝑥y(x)=xitalic_y ( italic_x ) = italic_x , y′⁢(x)=1superscript𝑦′𝑥1y^{\prime}(x)=1italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) = 1 and y′′⁢(x)=0superscript𝑦′′𝑥0y^{\prime\prime}(x)=0italic_y start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_x ) = 0. Therefore,

p′⁢(y⁢(x))=p′⁢(x)=−C⋅y⁢e−y22=−C⋅x⁢e−x22superscript𝑝′𝑦𝑥superscript𝑝′𝑥⋅𝐶𝑦superscript𝑒superscript𝑦22⋅𝐶𝑥superscript𝑒superscript𝑥22p^{\prime}(y(x))=p^{\prime}(x)=-C\cdot ye^{-\frac{y^{2}}{2}}=-C\cdot xe^{-% \frac{x^{2}}{2}}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ( italic_x ) ) = italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) = - italic_C ⋅ italic_y italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT = - italic_C ⋅ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT

Ultimately, by incorporating p′⁢(y)=−C⋅x⁢e−x22superscript𝑝′𝑦⋅𝐶𝑥superscript𝑒superscript𝑥22p^{\prime}(y)=-C\cdot xe^{-\frac{x^{2}}{2}}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) = - italic_C ⋅ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , y′⁢(x)=1superscript𝑦′𝑥1y^{\prime}(x)=1italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) = 1 and y′′⁢(x)=0superscript𝑦′′𝑥0y^{\prime\prime}(x)=0italic_y start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_x ) = 0 into Equation 7, we can obtain:

η⁢(x)=−C⋅x⁢e−x22𝜂𝑥⋅𝐶𝑥superscript𝑒superscript𝑥22\eta(x)=-C\cdot xe^{-\frac{x^{2}}{2}}italic_η ( italic_x ) = - italic_C ⋅ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT

Furthermore, we make constant C𝐶Citalic_C as a learnable parameter ϵitalic-ϵ\epsilonitalic_ϵ with the purpose of enabling self-optimization in networks. According to EAFO methodology, we can get the inverse function of revised activation function as follows:

g⁢(x)=x−ϵ⁢x⁢e−x22x⩾0formulae-sequence𝑔𝑥𝑥italic-ϵ𝑥superscript𝑒superscript𝑥22𝑥0g(x)=x-\epsilon xe^{-\frac{x^{2}}{2}}\quad\quad x\geqslant 0italic_g ( italic_x ) = italic_x - italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_x ⩾ 0 (8)

Finally, the optimized activation function CRReLU can be obtained by deriving the inverse function of g⁢(x)𝑔𝑥g(x)italic_g ( italic_x ). However, obtaining the inverse function of Equation 8 presents a challenge using conventional methods; as a consequence, we use the following function as a form of practical approximation.

f⁢(x)=x+ϵ⁢x⁢e−x22x⩾0formulae-sequence𝑓𝑥𝑥italic-ϵ𝑥superscript𝑒superscript𝑥22𝑥0f(x)=x+\epsilon xe^{-\frac{x^{2}}{2}}\quad\quad x\geqslant 0italic_f ( italic_x ) = italic_x + italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_x ⩾ 0 (9)

We show the rationalization and reliability of utilizing Equation 9 as the approximate inverse function of Equation 8 in Proposition 2

Proposition 2.

Known g⁢(x)=x−ϵ⁢x⁢e−x22,f⁢(x)=x+ϵ⁢x⁢e−x22formulae-sequence𝑔𝑥𝑥italic-ϵ𝑥superscript𝑒superscript𝑥22𝑓𝑥𝑥italic-ϵ𝑥superscript𝑒superscript𝑥22g(x)=x-\epsilon xe^{-\frac{x^{2}}{2}},f(x)=x+\epsilon xe^{-\frac{x^{2}}{2}}italic_g ( italic_x ) = italic_x - italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , italic_f ( italic_x ) = italic_x + italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT, for x⩾0𝑥0x\geqslant 0italic_x ⩾ 0 , the absolute value of error between g⁢(f⁢(x))𝑔𝑓𝑥g\left(f(x)\right)italic_g ( italic_f ( italic_x ) ) and x𝑥xitalic_x is bounded with |e−1⁢ϵ2+0.5⁢e−32⁢ϵ3|superscript𝑒1superscriptitalic-ϵ20.5superscript𝑒32superscriptitalic-ϵ3\left|e^{-1}\epsilon^{2}+0.5e^{-\frac{3}{2}}\epsilon^{3}\right|| italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 0.5 italic_e start_POSTSUPERSCRIPT - divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT |.

Detailed proof of Proposition 2 can be seen in Appendix C.

As illustrated in Section 4.2, ϵ⁢η⁢(x)italic-ϵ𝜂𝑥\epsilon\eta(x)italic_ϵ italic_η ( italic_x ) is the small perturbation; hence, from a theoretical perspective, we can set ϵ⁢η⁢(x)italic-ϵ𝜂𝑥\epsilon\eta(x)italic_ϵ italic_η ( italic_x ) as an infinitesimal. Furthermore, in this case, given the knowledge that η⁢(x)𝜂𝑥\eta(x)italic_η ( italic_x ) is a bounded function, we can easily deduce that ϵitalic-ϵ\epsilonitalic_ϵ is also an infinitesimal. Therefore, the absolute value of error between g⁢(f⁢(x))𝑔𝑓𝑥g\left(f(x)\right)italic_g ( italic_f ( italic_x ) ) and x𝑥xitalic_x is an infinitesimal of higher order. In practice, we typically initialize ϵitalic-ϵ\epsilonitalic_ϵ to a small value, such as 0.01 (as described in Section 5), implying that the absolute value of error is a small value.

Finally, let’s consider the part where x𝑥xitalic_x is negative. When x𝑥xitalic_x is negative, the inverse function of ReLU can be visualized as a ray emanating from the origin and extending to infinity, possessing an infinite slope; and when x𝑥xitalic_x is positive, it constitutes a ray with the slope of 1. Hence, the correction term solution for both positive and negative values of x𝑥xitalic_x can be considered identical, differing only by constant C𝐶Citalic_C. In Equation 9 and Proposition 2, it is shown that incorporating the correction term into a linear activation function can have beneficial effects by reducing the information entropy. Therefore, we can obtain the full form of Correction Regularized ReLU as:

f⁢(x)=max⁢(0,x)+ε⁢x⁢e−x22𝑓𝑥max0𝑥𝜀𝑥superscript𝑒superscript𝑥22f(x)=\text{max}(0,x)+\varepsilon xe^{-\frac{x^{2}}{2}}italic_f ( italic_x ) = max ( 0 , italic_x ) + italic_ε italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT (10)

Discussion on introduced learnable parameter ϵitalic-ϵ\epsilonitalic_ϵ. In Section 4.2, we have successfully demonstrated existence of the worst activation function, and from the worst as a starting point, it always moves towards improvement, regardless of the direction taken. However, commencing from a specific activation function, like ReLU here, does not invariably result in improvement across all directions, i.e. certain optimization paths may lead to deteriorated outcomes. Therefore, from the practical perspective, we introduce learnable parameter ϵitalic-ϵ\epsilonitalic_ϵ with the aim of enabling self-optimization of networks. From another perspective, in the derivation from ReLU to CRReLU, we assume that data follows Gaussian distribution, which might not be true in real experiments. Existence of the learnable parameter ϵitalic-ϵ\epsilonitalic_ϵ also weakens this assumption to some extent.

Finally, we provide further details of CRReLU in Appendix D, including python-like pseudocode of CRReLU in Appendix D.1, and further discussion on properties of CRReLU in Appendix D.2.

5 Experiments

Datasets. In experiments of image classification task, we adopt three datasets, ordered as CIFAR-10 [7], CIFAR-100 [7] and ImageNet-1K [8] in terms of the number of classification categories. In experiments of large language model (LLM) fine-tuning task, we employ two human preference datasets: SHP [33] and HH [34].

Baselines. We conduct experiments comparing the performance of CRReLU with several typical existing corrections of ReLU as illustrated in Section 2 and Section 3 : PReLU [19], ELU [20], CELU [21], GELU [24], Swish (SiLU) [22] and Mish [23].

Experimental hyperparameters. For all transformer-based architectures, we directly set ϵitalic-ϵ\epsilonitalic_ϵ to 0.01 without further optimization. Detailed experimental hyperparameters are provided in Appendix E.

5.1 Task of Image Classification

We conduct all experiments on 4×\times×RTX3090 for 100 epochs using the AdamW optimizer with weight decay of 0.05, gradient clipping norm of 1.0, cross entropy loss function, and cosine annealing learning rate scheduler with linear warm-up.

Experiments of ViTs on CIFAR-10 and CIFAR-100.Vision Transformer and its variants possess sufficiently complex structure and representational capability, garnering widespread attention from the community. Moreover, the assumption of Gaussian distribution has been theoretically proved as reasonable for sufficiently large MLPs [28, 29, 30, 31] and CNNs [32]; however, the distribution of data under attention mechanism of transformers remains unexplored. Hence, we select vision transformer and its variants as our test model in order to further investigate the insensitivity of CRReLU to data distribution. Phase of experiments on CIFAR-10 and CIFAR-100 involves the selection of Vision Transformer (ViT) [4], Data-Efficient Image Transformer (DeiT) [5] and Transformer in Transformer (TNT) [6]. We report the top-one accuracy on CIFAR-10 in Table 1 and CIFAR-100 in Table 2, demonstrating CRReLU outperforms other existing corrections of ReLU on CIFAR dataset.

Table 1: Test accuracy of experiments conducted on CIFAR-10 for 100 epochs.

Top-one Accuracy GELU ELU PReLU CELU SiLU Mish
CIFAR-10 ViT-Tiny 0.706 0.669 0.786 0.669 0.683 0.687 0.802
CIFAR-10 DeiT-Tiny 0.716 0.671 0.753 0.671 0.694 0.695 0.768
CIFAR-10 TNT-Small 0.743 0.689 0.761 0.689 0.719 0.725 0.775

Table 2: Test accuracy of experiments conducted on CIFAR-100 for 100 epochs.

Top-one Accuracy GELU ELU PReLU CELU SiLU Mish
CIFAR-100 ViT-Tiny 0.322 0.287 0.421 0.287 0.306 0.297 0.459
CIFAR-100 DeiT-Tiny 0.460 0.400 0.493 0.400 0.429 0.429 0.508
CIFAR-100 TNT-Small 0.484 0.435 0.498 0.435 0.459 0.464 0.508

Experiments of ViTs on ImageNet-1K. ImageNet-1K dataset poses a significant challenge to information processing capability of neural networks due to its large image size and extensive range of classification categories. Hence, phase of experiments on ImageNet-1K involves the selection of Vision Transformer (ViT) [4] and Data-Efficient Image Transformer (DeiT) [5]. We report the top-one accuracy on ImageNet-1K in Table 3.

Table 3: Test accuracy of experiments conducted on ImageNet-1K for 100 epochs.

Top-one Accuracy GELU ELU PReLU CELU SiLU Mish
ImageNet-1K ViT-Tiny 0.542 0.384 0.572 0.384 0.469 0.479 0.579
ImageNet-1K DeiT-Tiny 0.619 0.497 0.612 0.497 0.584 0.592 0.615

Experiments on ViT clearly demonstrate superiority of CRReLU over other activation functions, and those on DieT, GELU shows 0.4% higher accuracy compared to CRReLU. Such result is attributed to the teacher-student strategy structure of DieT model. We utilize the fine-tuned "deit-tiny-patch16-224" model as teacher model, which is trained with GELU. As explained in the work [35], through distillation, transformers will inherit inductive bias. Hence, training a student model with GELU on ImageNet-1K with the help of teacher model, which has already been pre-trained on ImageNet-1K with GELU, is certain to achieve better results than other activation functions.

5.2 Task of Large Language Model (LLM) Fine-tuning

In order to further validate the effectiveness of CRReLU on larger networks and generalization to a richer range of applications, we further perform supplementary experiments on LLM fine-tuning task. We employ the Direct Preference Optimization (DPO) [9] method to fine-tune GPT-2 [36] on Stanford Human Preferences (SHP) dataset [33] and Anthropic HH dataset [34]. The parameter number of GPT-2 is 137 M, a relatively modest magnitude, hence we conduct full fine-tuning instead of LoRA-based one on 2×\times×RTX3090. Firstly, we carry out supervised fine-tuning (SFT) with the purpose of mitigating distribution shift between the true reference distribution which is unavailable, and the reference policy utilized by DPO. Subsequently, we separately set the penalty coefficient β𝛽\betaitalic_β as 0.1, 1, 2, and 5, in order to compare the performance of CRReLU and GELU under different penalty coefficients, and then execute DPO. We report evaluation metrics of fine-tuning process in Table 4, demonstrating CRReLU generally outperforms GELU in LLM fine-tuning task.

Table 4: Metrics comparison between CRReLU and GELU in the task of LLM fine-tuning.

Evaluation Metrics Evaluation Margin Reward↑↑\uparrow↑ Evaluation Accuracy↑↑\uparrow↑ Evaluation Loss↓↓\downarrow↓
β𝛽\betaitalic_β = 0.1 CRReLU 0.1428 0.6210 0.6476
GELU 0.1419 0.6196 0.6480
β𝛽\betaitalic_β = 1 CRReLU 0.4626 0.5756 0.9201
GELU 0.4556 0.5731 0.9298
β𝛽\betaitalic_β = 2 CRReLU 0.7736 0.5628 1.462
GELU 0.7176 0.5606 1.481
β𝛽\betaitalic_β = 5 CRReLU 1.846 0.5635 3.268
GELU 1.651 0.5566 3.305

6 Discussion

Pursuit of better activation functions has been a longstanding and fundamental topic in the realm of machine learning. However, prior research has consistently concentrated on empirical search, without an emphasis on understanding the underlying mathematical mechanisms. This work aims to offer a proper solution to such issue. Our investigation into the relationship between activation functions and information theory concepts reveals that information entropy can be represented as a functional. Existence of the worst activation function with boundary condition (WAFBC) furnishes a solid theoretical basis for exploring better activation functions. In the process of solving WAFBC, we draw inspiration from the Taylor expansion form, leading us to propose Entropy-based Activation Function Optimization (EAFO) methodology. EAFO methodology presents a novel perspective for designing static activation functions in deep neural networks and shows the potential of dynamically optimizing activation during iterative training. Utilizing EAFO methodology, we derive a novel activation function from ReLU, called Correction Regularized ReLU (CRReLU). Experiments involving image classification task and large language model (LLM) fine-tuning task demonstrate that CRReLU is comparable to or surpasses existing corrections of ReLU. Overall, the EAFO methodology provides numerous promising avenues for future research on activation functions, and the CRReLU introduces a novel addition to the set of high-performing activation functions.

Limitations and Future Work. Our findings raise several important questions for future work. Firstly, how can EAFO framework be systematically generalized to non-invertible activation functions? In the initial setting of EAFO methodology, the choice of activation function is restricted to those with invertible counterparts. Despite ReLU being a prominent example of activation function without an inverse, we derive CRReLU utilizing EAFO; however, the derivation also partly benefits from the simplicity of ReLU’s form and several heuristic approaches. Secondly, how to effectively implement activation function iteration optimization during neural network training? Notwithstanding the demonstrated feasibility of iterative activation function optimization during neural network training, it is currently hindered by the high computational complexity, particularly in large-scale neural networks. Applicability of the EAFO methodology to optimize activation in alternative network structures, such as Kolmogorov-Arnold Networks (KANs), also deserves further in-depth research. Therefore, the development of practical and efficient algorithms is an exciting direction for future work. Finally, while we have empirically validated the exceptional performance of CRReLU on image classification task and large language model fine-tuning task, its performance on other tasks remains to be explored, thereby warranting further investigation.

References

Appendix A Proof of Proposition 1

Proof.

From Equation 2, we know that:

dd⁢x⁢(∂𝔾∂y′)−∂𝔾∂y=0dd𝑥𝔾superscript𝑦′𝔾𝑦0\frac{\text{d}}{\text{d}x}(\frac{\partial\mathbb{G}}{\partial y^{\prime}})-% \frac{\partial\mathbb{G}}{\partial y}=0divide start_ARG d end_ARG start_ARG d italic_x end_ARG ( divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) - divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y end_ARG = 0

Considering the total differential of 𝔾𝔾\mathbb{G}blackboard_G:

d⁢𝔾d⁢x⁢(y′,y,x)=∂𝔾∂x⋅d⁢xd⁢x+∂𝔾∂y⋅d⁢yd⁢x+∂𝔾∂y′⋅d⁢y′d⁢x=∂𝔾∂x+∂𝔾∂y⋅y′+∂𝔾∂y′⋅y′′d𝔾d𝑥superscript𝑦′𝑦𝑥⋅𝔾𝑥d𝑥d𝑥⋅𝔾𝑦d𝑦d𝑥⋅𝔾superscript𝑦′dsuperscript𝑦′d𝑥𝔾𝑥⋅𝔾𝑦superscript𝑦′⋅𝔾superscript𝑦′superscript𝑦′′\frac{\text{d}\mathbb{G}}{\text{d}x}\left(y^{\prime},y,x\right)=\frac{\partial% \mathbb{G}}{\partial x}\cdot\frac{\text{d}x}{\text{d}x}+\frac{\partial\mathbb{% G}}{\partial y}\cdot\frac{\text{d}y}{\text{d}x}+\frac{\partial\mathbb{G}}{% \partial y^{\prime}}\cdot\frac{\text{d}y^{\prime}}{\text{d}x}=\frac{\partial% \mathbb{G}}{\partial x}+\frac{\partial\mathbb{G}}{\partial y}\cdot y^{\prime}+% \frac{\partial\mathbb{G}}{\partial y^{\prime}}\cdot y^{\prime\prime}divide start_ARG d blackboard_G end_ARG start_ARG d italic_x end_ARG ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y , italic_x ) = divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_x end_ARG ⋅ divide start_ARG d italic_x end_ARG start_ARG d italic_x end_ARG + divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y end_ARG ⋅ divide start_ARG d italic_y end_ARG start_ARG d italic_x end_ARG + divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG d italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG d italic_x end_ARG = divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_x end_ARG + divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y end_ARG ⋅ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ⋅ italic_y start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT

Thus, we have:

dd⁢x⁢(y′⁢∂𝔾∂y′)=y′′⁢∂𝔾∂y′+y′⁢dd⁢x⁢(∂𝔾∂y′)=d⁢𝔾d⁢x⁢(y′,y,x)−∂𝔾∂y⋅y′−∂𝔾∂x+y′⁢dd⁢x⁢(∂𝔾∂y′)=dd⁢x⁢𝔾⁢(y′,y,x)−∂𝔾∂x−y′⋅(∂𝔾∂y−dd⁢x⁢(∂𝔾∂y′))=dd⁢x⁢𝔾⁢(y′,y,x)−∂𝔾∂xdd𝑥superscript𝑦′𝔾superscript𝑦′superscript𝑦′′𝔾superscript𝑦′superscript𝑦′dd𝑥𝔾superscript𝑦′d𝔾d𝑥superscript𝑦′𝑦𝑥⋅𝔾𝑦superscript𝑦′𝔾𝑥superscript𝑦′dd𝑥𝔾superscript𝑦′dd𝑥𝔾superscript𝑦′𝑦𝑥𝔾𝑥⋅superscript𝑦′𝔾𝑦dd𝑥𝔾superscript𝑦′dd𝑥𝔾superscript𝑦′𝑦𝑥𝔾𝑥\begin{split}\frac{\text{d}}{\text{d}x}\left(y^{\prime}\frac{\partial\mathbb{G% }}{\partial y^{\prime}}\right)&=y^{\prime\prime}\frac{\partial\mathbb{G}}{% \partial y^{\prime}}+y^{\prime}\frac{\text{d}}{\text{d}x}\left(\frac{\partial% \mathbb{G}}{\partial y^{\prime}}\right)\\ &=\frac{\text{d}\mathbb{G}}{\text{d}x}\left(y^{\prime},y,x\right)-\frac{% \partial\mathbb{G}}{\partial y}\cdot y^{\prime}-\frac{\partial\mathbb{G}}{% \partial x}+y^{\prime}\frac{\text{d}}{\text{d}x}\left(\frac{\partial\mathbb{G}% }{\partial y^{\prime}}\right)\\ &=\frac{\text{d}}{\text{d}x}\mathbb{G}\left(y^{\prime},y,x\right)-\frac{% \partial\mathbb{G}}{\partial x}-y^{\prime}\cdot\left(\frac{\partial\mathbb{G}}% {\partial y}-\frac{\text{d}}{\text{d}x}\left(\frac{\partial\mathbb{G}}{% \partial y^{\prime}}\right)\right)\\ &=\frac{\text{d}}{\text{d}x}\mathbb{G}\left(y^{\prime},y,x\right)-\frac{% \partial\mathbb{G}}{\partial x}\end{split}start_ROW start_CELL divide start_ARG d end_ARG start_ARG d italic_x end_ARG ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) end_CELL start_CELL = italic_y start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG + italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT divide start_ARG d end_ARG start_ARG d italic_x end_ARG ( divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG d blackboard_G end_ARG start_ARG d italic_x end_ARG ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y , italic_x ) - divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y end_ARG ⋅ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_x end_ARG + italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT divide start_ARG d end_ARG start_ARG d italic_x end_ARG ( divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG d end_ARG start_ARG d italic_x end_ARG blackboard_G ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y , italic_x ) - divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_x end_ARG - italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ ( divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y end_ARG - divide start_ARG d end_ARG start_ARG d italic_x end_ARG ( divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG d end_ARG start_ARG d italic_x end_ARG blackboard_G ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y , italic_x ) - divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_x end_ARG end_CELL end_ROW

Therefore, we know that

∂𝔾∂x−dd⁢x⁢(𝔾−y′⁢∂𝔾∂y′)=0𝔾𝑥dd𝑥𝔾superscript𝑦′𝔾superscript𝑦′0\frac{\partial\mathbb{G}}{\partial x}-\frac{\text{d}}{\text{d}x}\left(\mathbb{% G}-y^{\prime}\frac{\partial\mathbb{G}}{\partial y^{\prime}}\right)=0divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_x end_ARG - divide start_ARG d end_ARG start_ARG d italic_x end_ARG ( blackboard_G - italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) = 0

For the sake that 𝔾𝔾\mathbb{G}blackboard_G is independent of x𝑥xitalic_x, then we have that ∂𝔾∂x=0𝔾𝑥0\frac{\partial\mathbb{G}}{\partial x}=0divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_x end_ARG = 0. Hence,

dd⁢x⁢(𝔾−y′⁢∂𝔾∂y′)=0dd𝑥𝔾superscript𝑦′𝔾superscript𝑦′0\frac{\text{d}}{\text{d}x}\left(\mathbb{G}-y^{\prime}\frac{\partial\mathbb{G}}% {\partial y^{\prime}}\right)=0divide start_ARG d end_ARG start_ARG d italic_x end_ARG ( blackboard_G - italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) = 0

Finally, we can draw the conclusion that:

𝔾−y′⁢∂𝔾∂y′=C𝔾superscript𝑦′𝔾superscript𝑦′𝐶\mathbb{G}-y^{\prime}\frac{\partial\mathbb{G}}{\partial y^{\prime}}=Cblackboard_G - italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT divide start_ARG ∂ blackboard_G end_ARG start_ARG ∂ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG = italic_C

, which completes the proof. ∎

Appendix B Further Discussion on WAFBC

Let’s take several typical boundary conditions into consideration. Firstly, setting f⁢(x)𝑓𝑥f(x)italic_f ( italic_x ) approaches 1, when x𝑥xitalic_x tends to positive infinity; and f⁢(x)𝑓𝑥f(x)italic_f ( italic_x ) approaches 0, when x𝑥xitalic_x tends to negative infinity. Therefore, the solution takes the form of cumulative distribution function (CDF), which can be expresses as:

f⁢(x)=∫−∞xp⁢(t)⁢d⁢t𝑓𝑥superscriptsubscript𝑥𝑝𝑡d𝑡f(x)={\int_{-\infty}^{x}p(t)\text{d}t}italic_f ( italic_x ) = ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT italic_p ( italic_t ) d italic_t

Similarly, if fixing the difference between the upper and lower bounds of the activation function to be e𝑒eitalic_e, and making the activation function symmetric about the origin, the form can be written as:

f⁢(x)=e⁢∫0xp⁢(t)⁢d⁢t𝑓𝑥𝑒superscriptsubscript0𝑥𝑝𝑡d𝑡f(x)=e{\int_{0}^{x}p(t)\text{d}t}italic_f ( italic_x ) = italic_e ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT italic_p ( italic_t ) d italic_t

Furthermore, in the event that the input data distribution is assumed to be approximately uniformly distributed, the worst activation function can be approximated as a linear function. Were it to approximate the input data distribution as a normal distribution, then the form of the worst activation function would be closer to Sigmoid and Tanh. We show the comparison of function curves in Figure 1 and Figure 2.

Figure 1: Comparison between Sigmoid and standard normal CDF

Figure 2: Comparison between Tanh and Standard Normal CDF multiplied by e𝑒eitalic_e (has been transformed to achieve symmetry about origin)

Appendix C Proof of Proposition 2

Before the proof of Proposition 2, we would like to show four facts without proof.

Fact 1.

f⁢(x)=x⁢e−x22𝑓𝑥𝑥superscript𝑒superscript𝑥22f(x)=xe^{-\frac{x^{2}}{2}}italic_f ( italic_x ) = italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT is a bounded function, and range of the function is [−e−12,e−12]superscript𝑒12superscript𝑒12[-e^{-\frac{1}{2}},e^{-\frac{1}{2}}][ - italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ].

Fact 2.

f⁢(x)=x2⁢e−x2𝑓𝑥superscript𝑥2superscript𝑒superscript𝑥2f(x)=x^{2}e^{-x^{2}}italic_f ( italic_x ) = italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is a bounded function, and range of the function is [0,e−1]0superscript𝑒1[0,e^{-1}][ 0 , italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ].

Fact 3.

f⁢(x)=x3⁢e−32⁢x2𝑓𝑥superscript𝑥3superscript𝑒32superscript𝑥2f(x)=x^{3}e^{-\frac{3}{2}x^{2}}italic_f ( italic_x ) = italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG 3 end_ARG start_ARG 2 end_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is a bounded function, and range of the function is [−e−32,e−32]superscript𝑒32superscript𝑒32[-e^{-\frac{3}{2}},e^{-\frac{3}{2}}][ - italic_e start_POSTSUPERSCRIPT - divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT - divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ].

Fact 4.

∀x∈ℛfor-all𝑥ℛ\forall x\in\mathcal{R}∀ italic_x ∈ caligraphic_R, 1−e−x−x⩽01superscript𝑒𝑥𝑥01-e^{-x}-x\leqslant 01 - italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT - italic_x ⩽ 0.

We now commence the proof of Proposition 2.

Proof.

Substituting the analytic expression into the formula and performing algebraic simplifications, we can obtain:

g⁢(f⁢(x))=g⁢(x+ϵ⁢x⁢e−x22)=x+ϵ⁢x⁢e−x22−ϵ⁢(x+ϵ⁢x⁢e−x22)⁢e−12⁢(x+ϵ⁢x⁢e−x22)2=x+ϵ⁢x⁢(e−x22−e−12⁢(x+ϵ⁢x⁢e−x22)2)−ϵ2⁢x⁢e−x22⁢e−12⁢(x+ϵ⁢x⁢e−x22)2=x+ϵ⁢x⁢e−x22⁢[1−e−12⁢(2⁢ϵ⁢x⁢e−x22+ϵ2⁢x2⁢e−x2)]−ϵ2⁢x⁢e−x22⁢e−12⁢(x+ϵ⁢x⁢e−x22)2𝑔𝑓𝑥𝑔𝑥italic-ϵ𝑥superscript𝑒superscript𝑥22𝑥italic-ϵ𝑥superscript𝑒superscript𝑥22italic-ϵ𝑥italic-ϵ𝑥superscript𝑒superscript𝑥22superscript𝑒12superscript𝑥italic-ϵ𝑥superscript𝑒superscript𝑥222𝑥italic-ϵ𝑥superscript𝑒superscript𝑥22superscript𝑒12superscript𝑥italic-ϵ𝑥superscript𝑒superscript𝑥222superscriptitalic-ϵ2𝑥superscript𝑒superscript𝑥22superscript𝑒12superscript𝑥italic-ϵ𝑥superscript𝑒superscript𝑥222𝑥italic-ϵ𝑥superscript𝑒superscript𝑥22delimited-[]1superscript𝑒122italic-ϵ𝑥superscript𝑒superscript𝑥22superscriptitalic-ϵ2superscript𝑥2superscript𝑒superscript𝑥2superscriptitalic-ϵ2𝑥superscript𝑒superscript𝑥22superscript𝑒12superscript𝑥italic-ϵ𝑥superscript𝑒superscript𝑥222\begin{split}g\left(f(x)\right)=g\left(x+\epsilon xe^{-\frac{x^{2}}{2}}\right)% &=x+\epsilon xe^{-\frac{x^{2}}{2}}-\epsilon\left(x+\epsilon xe^{-\frac{x^{2}}{% 2}}\right)e^{-\frac{1}{2}\left(x+\epsilon xe^{-\frac{x^{2}}{2}}\right)^{2}}\\ &=x+\epsilon x\left(e^{-\frac{x^{2}}{2}}-e^{-\frac{1}{2}\left(x+\epsilon xe^{-% \frac{x^{2}}{2}}\right)^{2}}\right)-\epsilon^{2}xe^{-\frac{x^{2}}{2}}e^{-\frac% {1}{2}\left(x+\epsilon xe^{-\frac{x^{2}}{2}}\right)^{2}}\\ &=x+\epsilon xe^{-\frac{x^{2}}{2}}\left[1-e^{-\frac{1}{2}\left(2\epsilon xe^{-% \frac{x^{2}}{2}}+\epsilon^{2}x^{2}e^{-x^{2}}\right)}\right]-\epsilon^{2}xe^{-% \frac{x^{2}}{2}}e^{-\frac{1}{2}\left(x+\epsilon xe^{-\frac{x^{2}}{2}}\right)^{% 2}}\end{split}start_ROW start_CELL italic_g ( italic_f ( italic_x ) ) = italic_g ( italic_x + italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) end_CELL start_CELL = italic_x + italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT - italic_ϵ ( italic_x + italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x + italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_x + italic_ϵ italic_x ( italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x + italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) - italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x + italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_x + italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT [ 1 - italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 2 italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ] - italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x + italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW

Thus,

| |g⁢(f⁢(x))−x|=|ϵ⁢x⁢e−x22⁢[1−e−12⁢(2⁢ϵ⁢x⁢e−x22+ϵ2⁢x2⁢e−x2)]−ϵ2⁢x⁢e−x22⁢e−12⁢(x+ϵ⁢x⁢e−x22)2|⩽|ϵ⁢x⁢e−x22⁢[1−e−12⁢(2⁢ϵ⁢x⁢e−x22+ϵ2⁢x2⁢e−x2)]|⩽|ϵ⁢x⁢e−x22⁢[−2⁢ϵ⁢x⁢e−x22+ϵ2⁢x2⁢e−x22]|=|ϵ⁢x⁢e−x22⁢(−ϵ⁢x⁢e−x22−12⁢ϵ2⁢x2⁢e−x2)|=|ϵ2⁢x2⁢e−x2+12⁢ϵ3⁢x3⁢e−32⁢x2|⩽|e−1⁢ϵ2+0.5⁢e−32⁢ϵ3|𝑔𝑓𝑥𝑥italic-ϵ𝑥superscript𝑒superscript𝑥22delimited-[]1superscript𝑒122italic-ϵ𝑥superscript𝑒superscript𝑥22superscriptitalic-ϵ2superscript𝑥2superscript𝑒superscript𝑥2superscriptitalic-ϵ2𝑥superscript𝑒superscript𝑥22superscript𝑒12superscript𝑥italic-ϵ𝑥superscript𝑒superscript𝑥222italic-ϵ𝑥superscript𝑒superscript𝑥22delimited-[]1superscript𝑒122italic-ϵ𝑥superscript𝑒superscript𝑥22superscriptitalic-ϵ2superscript𝑥2superscript𝑒superscript𝑥2italic-ϵ𝑥superscript𝑒superscript𝑥22delimited-[]2italic-ϵ𝑥superscript𝑒superscript𝑥22superscriptitalic-ϵ2superscript𝑥2superscript𝑒superscript𝑥22italic-ϵ𝑥superscript𝑒superscript𝑥22italic-ϵ𝑥superscript𝑒superscript𝑥2212superscriptitalic-ϵ2superscript𝑥2superscript𝑒superscript𝑥2superscriptitalic-ϵ2superscript𝑥2superscript𝑒superscript𝑥212superscriptitalic-ϵ3superscript𝑥3superscript𝑒32superscript𝑥2superscript𝑒1superscriptitalic-ϵ20.5superscript𝑒32superscriptitalic-ϵ3\begin{split}\left|g(f(x))-x\right|&=\left|\epsilon xe^{-\frac{x^{2}}{2}}\left% [1-e^{-\frac{1}{2}\left(2\epsilon xe^{-\frac{x^{2}}{2}}+\epsilon^{2}x^{2}e^{-x% ^{2}}\right)}\right]-\epsilon^{2}xe^{-\frac{x^{2}}{2}}e^{-\frac{1}{2}\left(x+% \epsilon xe^{-\frac{x^{2}}{2}}\right)^{2}}\right|\\ &\leqslant\left|\epsilon xe^{-\frac{x^{2}}{2}}\left[1-e^{-\frac{1}{2}\left(2% \epsilon xe^{-\frac{x^{2}}{2}}+\epsilon^{2}x^{2}e^{-x^{2}}\right)}\right]% \right|\\ &\leqslant\left|\epsilon xe^{-\frac{x^{2}}{2}}\left[-\frac{2\epsilon xe^{-% \frac{x^{2}}{2}}+\epsilon^{2}x^{2}e^{-x^{2}}}{2}\right]\right|\\ &=\left|\epsilon xe^{-\frac{x^{2}}{2}}\left(-\epsilon xe^{-\frac{x^{2}}{2}}-% \frac{1}{2}\epsilon^{2}x^{2}e^{-x^{2}}\right)\right|=\left|\epsilon^{2}x^{2}e^% {-x^{2}}+\frac{1}{2}\epsilon^{3}x^{3}e^{-\frac{3}{2}x^{2}}\right|\\ &\leqslant\left|e^{-1}\epsilon^{2}+0.5e^{-\frac{3}{2}}\epsilon^{3}\right|\end{split}start_ROW start_CELL | italic_g ( italic_f ( italic_x ) ) - italic_x | end_CELL start_CELL = | italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT [ 1 - italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 2 italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ] - italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x + italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⩽ | italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT [ 1 - italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 2 italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ] | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⩽ | italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT [ - divide start_ARG 2 italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ] | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = | italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( - italic_ϵ italic_x italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) | = | italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_ϵ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG 3 end_ARG start_ARG 2 end_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT | end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⩽ | italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 0.5 italic_e start_POSTSUPERSCRIPT - divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT | end_CELL end_ROW | | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------- | ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | - | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------ |

The first inequality is established owing to Fact 1 and the fact that when x𝑥xitalic_x is positive, the second term of absolute value must be positive. The second inequality is established owing to Fact 4. The third inequality is established owing to Fact 2 and Fact 3. Hence, we can draw the conclusion that the absolute value of error between g⁢(f⁢(x))𝑔𝑓𝑥g\left(f(x)\right)italic_g ( italic_f ( italic_x ) ) and x𝑥xitalic_x is bounded with |e−1⁢ϵ2+0.5⁢e−32⁢ϵ3|superscript𝑒1superscriptitalic-ϵ20.5superscript𝑒32superscriptitalic-ϵ3\left|e^{-1}\epsilon^{2}+0.5e^{-\frac{3}{2}}\epsilon^{3}\right|| italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 0.5 italic_e start_POSTSUPERSCRIPT - divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT |, which completes the proof.

Appendix D Further details of CRReLU

D.1 Correction Regularized ReLU (CRReLU) Pseudocode

import torch

import torch.nn as nn

import torch.nn.functional as F

class CRReLU(nn.Module):

def __init__(self,lr=0.01):

super(CRReLU,self).__init__()

self.lr = nn.Parameter(torch.tensor(lr))

def forward(self,x):

return F.relu(x)+self.lr*x*torch.exp(-x**2/2)

Algorithm 1 Correction Regularized ReLU (CRReLU) Pseudocode

D.2 Further Discussion on Properties of CRReLU

We show the function curves with different ϵitalic-ϵ\epsilonitalic_ϵ values for CRReLU in Figure 3. As depicted in the figure, existence of the correction term in CRReLU brings several good properties. It allows propagation of gradient when input is less than zero, serving to alleviate the dying ReLU phenomenon to a certain degree; simultaneously, as x𝑥xitalic_x approaches negative infinity, CRReLU also converges to 00, thereby guaranteeing sparsity of models in the negative part.

Figure 3: CRReLU with different ϵitalic-ϵ\epsilonitalic_ϵ value

Appendix E Details of experimental settings

E.1 Task of Image Classification

Table 5: Experimental settings of ViT, DeiT and TNT on CIFAR-10 and CIFAR-100 datasets

Image Size 32 ×\times× 32
Patch Size 4
Embedding Dim 192 for ViT-Tiny and DeiT-Tiny ; 384 for TNT-small
Optimizer AdamW with weight decay = 0.05
Learning Rate Cosine Annealing Learning Rate Scheduler Initial lr = 2.5×10−4absentsuperscript104\times 10^{-4}× 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ; lr drop = -1 ; min lr = 1 ×10−5absentsuperscript105\times 10^{-5}× 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
Warm up warmup epochs = 20 ; warmup learning rate = 1×10−6absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
Gradient Clipping 1.0
Training Epochs 100
Batch Size 256
Loss Function CrossEntropy Loss
Normalization Layer Norm
Data Augmentation True (provided by timm)
Drop Out and Drop Path False

Table 6: Experimental settings of ViT and DeiT on ImageNet-1K dataset

Image Size 224 ×\times× 224
Patch Size 16
Embedding Dim 192
Optimizer AdamW with weight decay = 0.05
Learning Rate Cosine Annealing Learning Rate Scheduler Initial lr = 2.5×10−4absentsuperscript104\times 10^{-4}× 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ; lr drop = -1 ; min lr = 1 ×10−5absentsuperscript105\times 10^{-5}× 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
Warm up warmup epochs = 20 ; warmup learning rate = 1×10−6absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
Gradient Clipping 1.0
Training Epochs 100
Batch Size 256
Loss Function CrossEntropy Loss
Normalization Layer Norm
Data Augmentation True (provided by timm)
Drop Out and Drop Path False

Table 7: We record changes in parameter number when employing various activation functions. GELU, ELU, CELU, SiLU (Swish), and Mish are considered activation functions without learnable parameter (AFs without LP), while PReLU and CRReLU are considered activation functions with learnable parameter (AFs with LP). The results demonstrate that increase in parameter number introduced by the learnable parameter is negligible.

Parameter Number CIFAR-10 CIFAR-100 ImageNet-1K
ViT-Tiny AFs without LP 5399818 5417188 5754472
AFs with LP 5399830 5417200 5754484
DeiT-Tiny AFs without LP 5365076 5399816 5910800
AFs with LP 5365088 5399828 5910812
TNT-Small AFs without LP 21525298 21559948 /
AFs with LP 21525322 21559972 /

E.2 Task of Large Language Model (LLM) Fine-tuning

Table 8: Experimental settings of GPT2 fine-tuning task

Batch Size 32
Optimizer RMSprop (More Memory-Efficient)
Learning Rate 5×10−7absentsuperscript107\times 10^{-7}× 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT with linear warmup steps of 150
Trainer FSDPTrainer (2 GPUs)
Max Gradient Norm 10.0
Max Length for an Input (Prompt + Response) 512
Max Length for Prompt 256