Self-Normalizing Neural Networks

SELU, SNN successes, and references (2017–2025)

View the Project on GitHub bioinf-jku/SNNs

The Great Comeback of Self-Normalizing Networks in 2025

Günter Klambauer

It has been a wild year in AI and especially for self-normalizing networks and SELU activations!

Normalization-free Transformers. Will we get LLMs without normalization layers?

In March 2025, I saw “Transformers without normalization” by Yann LeCun and colleagues drop on arxiv. I thought “now they finally have it”, because Yann has been thinking in similar directions as I did already back in his “Efficient Backprop” tutorial. After all, self-normalizing networks require the initialization named after him (“LeCun’s initialization”). Strangely it’s just a scaled tanh-activation that does the job.. ok!

The 3x64 baseline phenomenon in conditional flow matching and Schrödinger bridges

Then we got all this nice work on conditional flow matching: here 2 or 3 layer SELU-networks with a width of 64 have quasi become standard since Alex Tong’s work and implementation in torch CMF. Here the SELU-network represents the derivative of another function – this is where the smoothness of SELU networks, i.e. smooth derivative of the other function, is clearly the improvement over ReLU networks.

Time-series foundation models rely on SELU

2025 was clearly the year of time-series foundation models and I am very happy that we had a part in this. Clearly our TiRex taking the lead in the GIFT Eval leaderboard (ahead of Amazon’s Chronos) was one of my favourite moments in 2025. However, the other foundation models, like FIM/FIM-ℓ and Flowstate, they all use SELU activations.

RL systems use SELU for stability

One of the quiet but undeniable trends of 2025 is the return of SELU in reinforcement learning. Across several independent lines of work, researchers rediscovered what we already saw in 2017 during the Learning-to-Run challenge: 7 actor–critic methods become meaningfully more stable when the policy/value heads use SELU instead of ReLU. This year, the evidence became impossible to ignore: In PPO-based code optimization, the Pearl system (2025) uses SELU inside its actor–critic MLPs and reports substantially smoother training dynamics during policy updates. HPC scheduling frameworks such as InEPS apply SELU in their PPO actor–critic networks to tame exploding/vanishing activations caused by heterogeneous inputs and reward signals. Multi-objective RL, e.g., latent-conditioned policy gradient methods increasingly defaults to SELU for all hidden layers, because it simply behaves more predictably under policy-gradient noise. I think the pattern is the following: Whenever RL systems avoid batch normalization (which they usually want or must), SELU becomes one of the most stable activations for deep value functions and stochastic policies.

Graph convolutional networks consistently replace ReLU with SELU for better convergence and robustness

A growing line of graph clustering (DMoN, DGCLUSTER, MetaGC, Potts-GNN) and privacy-preserving GNN work (LPGNN, GAP, UPGNET) consistently replaces ReLU with SELU and reports better convergence or robustness. Since DMoN, GCN use the following forward propagation:

\[H^{(l+1)} = \mathrm{SELU} \left( \tilde{A} H^{(l)} W^{(l)} + H^{(l)} W_{\text{skip}}^{(l)} \right)\]

where \(\tilde{A}\) is the normalized adjacency matrix. Classic GCN used \(H^{(l+1)} = \sigma \left( \tilde{A} H^{(l)} W^{(l)} \right)\) with sigmoid or ReLU activation. While the full SNN theory doesn’t directly apply to message-passing, a shallow GNN layer is still “linear aggregation + nonlinearity,” and SELU’s self-normalizing behavior seems to provide more stable training in normalization-free, noisy, or shallow GNN settings.

AI is hitting a wall in drug discovery, a wall built of SELUs

We’ve re-evaluated machine-learning and deep learning methods from the last 25 years at the Tox21 Data Challenge dataset. Ok, LLMs can do this at least a bit – but far off any reasonable performance. Recent methods like GNNs are a bit behing state-of-the-art, but we were actually extremely suprised that the SELU-networks from 2017 still perform best on this Tox21 leaderboard. People were wondering why AI hasn’t found a new drug yet, nor has improved drug discovery a lot.. yeah, this might be a hint. Deep Learning methods are good at DESIGNING molecules, and are brilliant at MAKING them (in the sense of predicting chemical synthesis routes), but AI systems are obviously BAD AT TESTING those molecules. By TESTING, i mean virtually testing them by predicting their biological properties, such as toxic effects. Well, suprisingly we have to improve the TEST in the DESIGN-MAKE-TEST-ANALYSE cycle.

Papers, models and architectures built on Self-Normalizing Networks (SELU / SNN)

Foundational papers

Generative Models

Diffusion & Score-based models

Flow Matching / Schrödinger Bridges

Multi-marginal / irregular-time dynamics

Normalizing Flows

Autoencoders & VAEs

GANs

MLPs / Tabular / Scientific MLPs

Transformers & LLM-adjacent models

Graph Neural Networks (GNNs)

Time series & Foundation(-style) models

Convolutional Neural Networks

Recurrent / Sequence models (non-Transformer)

Reinforcement Learning: improved stability of RL systems

Tutorials and implementations for “Self-normalizing networks”(SNNs) as suggested by Klambauer et al. (arXiv pre-print).

Versions

Note for Tensorflow >= 1.4 users

Tensorflow >= 1.4 already has the function tf.nn.selu and tf.contrib.nn.alpha_dropout that implement the SELU activation function and the suggested dropout version.

Note for Tensorflow >= 2.0 users

Tensorflow 2.3 already has selu activation function when using high level framework keras, tf.keras.activations.selu. Must be combined with tf.keras.initializers.LecunNormal, corresponding dropout version is tf.keras.layers.AlphaDropout.

Note for Pytorch users

Pytorch versions >= 0.2 feature torch.nn.SELU and torch.nn.AlphaDropout, they must be combined with the correct initializer, namely torch.nn.init.kaiming_normal_ (parameter, mode=’fan_in’, nonlinearity=’linear’) as this is identical to lecun initialisation (mode=’fan_in’) with a gain of 1 (nonlinearity=’linear’).

Tutorials

Tensorflow 1.x

Tensorflow 2.x (Keras)

Pytorch

Further material

Design novel SELU functions (Tensorflow 1.x)

Basic python functions to implement SNNs (Tensorflow 1.x)

are provided as code chunks here: selu.py

Notebooks and code to produce Figure 1 (Tensorflow 1.x)

are provided here: Figure1, builds on top of the biutils package.

Calculations and numeric checks of the theorems (Mathematica)

are provided as mathematica notebooks here:

UCI, Tox21 and HTRU2 data sets