The Great Comeback of Self-Normalizing Networks in 2025

Günter Klambauer

It has been a wild year in AI and especially for self-normalizing networks and SELU activations!

Normalization-Free Transformers have re-discovered controlled signal propagation.
SELU is the default in Conditional Flow Matching works – the 3×64 baseline phenomenon.
Time-Series Foundation Models adopt SELU in their architectures (FIM/FIM-ℓ, Flowstate).
The SELU-transformer made a resurgence in specialized NLP and tabular domains, as TabTranSELU kicking out that darn SwiGLU module.
RL systems are using SELU for stability in PPO, e.g. for code optimization.
Graph Convolutional Networks (GCN) have been using SELU activations since DMoN. In 2025, methods like GyralNet, use the design as standard.
AI systems in drug discovery still dominated by SNNs – AI is hitting a wall in drug discovery: No progress at toxicity/activity prediction for molecules; self-normalizing networks still in the lead aka leaderboard :)

Normalization-free Transformers. Will we get LLMs without normalization layers?

In March 2025, I saw “Transformers without normalization” by Yann LeCun and colleagues drop on arxiv. I thought “now they finally have it”, because Yann has been thinking in similar directions as I did already back in his “Efficient Backprop” tutorial. After all, self-normalizing networks require the initialization named after him (“LeCun’s initialization”). Strangely it’s just a scaled tanh-activation that does the job.. ok!

The 3x64 baseline phenomenon in conditional flow matching and Schrödinger bridges

Then we got all this nice work on conditional flow matching: here 2 or 3 layer SELU-networks with a width of 64 have quasi become standard since Alex Tong’s work and implementation in torch CMF. Here the SELU-network represents the derivative of another function – this is where the smoothness of SELU networks, i.e. smooth derivative of the other function, is clearly the improvement over ReLU networks.

Time-series foundation models rely on SELU

2025 was clearly the year of time-series foundation models and I am very happy that we had a part in this. Clearly our TiRex taking the lead in the GIFT Eval leaderboard (ahead of Amazon’s Chronos) was one of my favourite moments in 2025. However, the other foundation models, like FIM/FIM-ℓ and Flowstate, they all use SELU activations.

RL systems use SELU for stability

One of the quiet but undeniable trends of 2025 is the return of SELU in reinforcement learning. Across several independent lines of work, researchers rediscovered what we already saw in 2017 during the Learning-to-Run challenge: 7 actor–critic methods become meaningfully more stable when the policy/value heads use SELU instead of ReLU. This year, the evidence became impossible to ignore: In PPO-based code optimization, the Pearl system (2025) uses SELU inside its actor–critic MLPs and reports substantially smoother training dynamics during policy updates. HPC scheduling frameworks such as InEPS apply SELU in their PPO actor–critic networks to tame exploding/vanishing activations caused by heterogeneous inputs and reward signals. Multi-objective RL, e.g., latent-conditioned policy gradient methods increasingly defaults to SELU for all hidden layers, because it simply behaves more predictably under policy-gradient noise. I think the pattern is the following: Whenever RL systems avoid batch normalization (which they usually want or must), SELU becomes one of the most stable activations for deep value functions and stochastic policies.

Graph convolutional networks consistently replace ReLU with SELU for better convergence and robustness

A growing line of graph clustering (DMoN, DGCLUSTER, MetaGC, Potts-GNN) and privacy-preserving GNN work (LPGNN, GAP, UPGNET) consistently replaces ReLU with SELU and reports better convergence or robustness. Since DMoN, GCN use the following forward propagation:

\[H^{(l+1)} = \mathrm{SELU} \left( \tilde{A} H^{(l)} W^{(l)} + H^{(l)} W_{\text{skip}}^{(l)} \right)\]

where \(\tilde{A}\) is the normalized adjacency matrix. Classic GCN used \(H^{(l+1)} = \sigma \left( \tilde{A} H^{(l)} W^{(l)} \right)\) with sigmoid or ReLU activation. While the full SNN theory doesn’t directly apply to message-passing, a shallow GNN layer is still “linear aggregation + nonlinearity,” and SELU’s self-normalizing behavior seems to provide more stable training in normalization-free, noisy, or shallow GNN settings.

AI is hitting a wall in drug discovery, a wall built of SELUs

We’ve re-evaluated machine-learning and deep learning methods from the last 25 years at the Tox21 Data Challenge dataset. Ok, LLMs can do this at least a bit – but far off any reasonable performance. Recent methods like GNNs are a bit behing state-of-the-art, but we were actually extremely suprised that the SELU-networks from 2017 still perform best on this Tox21 leaderboard. People were wondering why AI hasn’t found a new drug yet, nor has improved drug discovery a lot.. yeah, this might be a hint. Deep Learning methods are good at DESIGNING molecules, and are brilliant at MAKING them (in the sense of predicting chemical synthesis routes), but AI systems are obviously BAD AT TESTING those molecules. By TESTING, i mean virtually testing them by predicting their biological properties, such as toxic effects. Well, suprisingly we have to improve the TEST in the DESIGN-MAKE-TEST-ANALYSE cycle.

Papers, models and architectures built on Self-Normalizing Networks (SELU / SNN)

Foundational papers

Self-Normalizing Neural Networks (NeurIPS 2017): Introduces SELU, AlphaDropout, and the self-normalizing theory enabling deep multi-layer perceptrons (MLPs) without explicit normalization.
Bidirectionally self-normalizing neural networks. Neural Networks, 167, 283-291 Extends SNNs to both forward and backward pass by introducing shift parameters in activation function

Generative Models

Diffusion & Score-based models

SE(3)-Equivariant Diffusion Graph Nets: Uses SELU activations in MLPs inside the diffusion-graph pipeline for fluid flow fields.
Radar emitter denoising with DDPM variants (2024): Employs SELU in 1D conv blocks within a DDPM-style architecture.
DIDiffGes: Decoupled Semi-Implicit Diffusion Models for Gesture Generation (2025): Uses MLP blocks with SELU in a diffusion+GAN hybrid for fast sampling.
Learning of Population Dynamics: Inverse Optimization Meets JKO Scheme (2025):
Their OT-map MLP concatenates time and uses SELU activations.

Flow Matching / Schrödinger Bridges

Improving and generalizing flow-based generative models with minibatch optimal transport Schrödinger bridge model is a self-normalizing MLP with SELU.
Simulation-Free Schrödinger Bridges via Score and Flow Matching (AISTATS 2024): Uses 3-layer MLPs with SELU activations in vector-field/score networks.
Meta Flow Matching (ICLR 2025): Uses multi-layer MLPs with SELU activations for synthetic and biological experiments.
Source-Guided Flow Matching (2025): Uses SELU-MLP vector fields with smoothing priors.
Explicit Flow Matching: On the theory of … (2024):
Evaluation uses a 3-layer MLP with SeLU activations and hidden dim 64.
Flows Don’t Cross in High Dimension (2025):
Neural vector-field experiments use a 3-layer MLP, hidden dim 64, with SELU.
OAT-FM: Optimal Acceleration Transport for Improved Flow Matching (2025):
Low-dimensional OT/CFM benchmarks parameterize the vector field with a 3-hidden-layer MLP (width 64) + SELU.
ParetoFlow: Guided Flows in Multi-Objective Optimization (2024):
Uses a multi-layer MLP with SeLU activations following flow-matching training protocols.
TorchCFM library:
Example configs reference 3×64 MLPs with SeLU** in action/CFM demos.

Multi-marginal / irregular-time dynamics

Multi-Marginal Stochastic Flow Matching for High-Dimensional Snapshot Data at Irregular Time Points (ICML 2025): Training setup uses MLPs with two hidden layers of width 64 and SELU activations for most non-image experiments.
Multi-Marginal Flow Matching with Adversarially Learnt Interpolants (2025): CFM nets in the cell-tracking experiment are 3-hidden-layer MLPs with 256 units per layer and SELU; also notes 3-layer SELU MLPs used for vector fields in single-cell setups.
Dynamic Conditional Optimal Transport through Simulation-Free Flows (2024): For FM and their COT-FM variant, the model architecture is an MLP with SeLU activations.

Normalizing Flows

Contextual Movement Models Based on Normalizing Flows (2021): Uses SELU in MLP components for movement modeling.
Individual Survival Curves with Conditional Normalizing Flows (2021): Uses SELU as activation across datasets in a CNF-based survival modeling setting.

Autoencoders & VAEs

Training Deep AutoEncoders for Collaborative Filtering (2017/2018): Practical deep AE results with SELU among competitive activations.
Application of generative autoencoder in de novo molecular design (2017): Reports faster convergence with SELU in molecular generation pipelines.

GANs

A Practical Approach for Training Deep Convolutional GANs with SELU Activation (2019): Empirical study suggesting SELU can help stabilize or accelerate GAN training in specific setups.
THINKING LIKE A MACHINE - Generating Visual Rationales with Wasserstein GANs (2017): Uses SELU-style setups to reduce reliance on batch normalization.

MLPs / Tabular / Scientific MLPs

Improving Palliative Care with Deep Learning (2017): Deep FNN with SELUs performed best in their comparisons.
Training Deep AutoEncoders for Collaborative Filtering (2017/2018): Observes deep autoencoders benefit from activation choices including SELU.

Transformers & LLM-adjacent models

TMRN-GLU: Transformer-Based Automatic Classification Recognition Network (2022): Chooses SELU where activations are needed, citing stability across SNR conditions.
SELU-transformer: Reports strong text classification performance with SELU-Transformer variants.
TabTranSELU: A transformer adaptation for solving tabular data (2024): Replaces a normalization+ReLU pattern with SELU for tabular transformer stability.
BERTSurv (clinical outcomes): Uses ReLU or SELU in downstream components alongside BERT.

Graph Neural Networks (GNNs)

Graph Clustering with Graph Neural Networks (DMoN, JMLR 2023): Uses SELU inside a modified GCN and explicitly replaces ReLU with SELU for better convergence.
Locally Private Graph Neural Networks (CCS 2021): Uses SELU in GCN/GAT/GraphSAGE backbones (two graph conv layers with SELU + dropout).
Reducing Oversmoothing in Graph Neural Networks by Activation Design (ICLR 2023 submission): Compares SELU-enhanced GCN/GAT variants and discusses depth/oversmoothing behavior.
Deep Probabilistic Dual Graph Convolution Network (NeurIPS 2023): Uses SELU in its GCN stack.

Time series & Foundation(-style) models

Zero-shot Imputation with Foundation Inference Models for Dynamical Systems (ICLR 2025): Uses SELU as activation for all feed-forward networks in FIM/FIM-ℓ modules.
FlowState: Sampling Rate Invariant Time Series Forecasting: main layer uses SELU.

Convolutional Neural Networks

Solving internal covariate shift in deep learning with linked neurons (2017): Shows ultra-deep CNN training issues without BN and highlights SELU-based solutions.
Point-wise Convolutional Neural Network (2017): Reports faster convergence with SELU vs ReLU in point-wise CNNs.
Effectiveness of Self Normalizing Neural Networks for Text Classification (2019): Applies SELU/SNN ideas to CNN-based text classification.

Recurrent / Sequence models (non-Transformer)

Learning to Run with Actor-Critic Ensemble (NIPS 2017 challenge report): Reports SELU outperforming several activations in their RL ensemble.
Sentiment extraction from Consumer-generated noisy short texts (2017): Uses SELU in feed-forward components.

Reinforcement Learning: improved stability of RL systems

Automated Cloud Provisioning on AWS using Deep Reinforcement Learning (2017): Deep CNN + DQN-style setup using SELU in the network architecture.
Learning to Run with Actor-Critic Ensemble (2017): Reports testing multiple activations and finding SELU superior; uses SELU in actor/critic FC layers.
Multi-Agent Trust Region Policy Optimization (MATRPO) (2020): Policies/critics use two hidden layers of 128 SeLU units in their experiments.
Application of Deep Q-Network in Portfolio Management (2020): Uses SELU in conv layers (argues negative-valued signals matter for this input type).
Latent-Conditioned Policy Gradient for Multi-Objective RL (2023): Uses SELU for most non-output activations in policy/value networks.
Intelligent Energy Pairing Scheduler (InEPS) for Heterogeneous HPC Clusters (2025): Uses SELU between actor/critic layers in a PPO-style scheduling system.
Quantum compiling by deep reinforcement learning (2021): Cross-domain RL application that may use SELU in parts of the network
Pearl: Automatic Code Optimization Using Deep Reinforcement Learning (2025): Uses PPO with a GNN backbone and SELU in the MLP before the policy/value heads.

Tutorials and implementations for “Self-normalizing networks”(SNNs) as suggested by Klambauer et al. (arXiv pre-print).

Versions

see environment file for full list of prerequisites. Tutorial implementations use Tensorflow > 2.0 (Keras) or Pytorch, but versions for Tensorflow 1.x users based on the deprecated tf.contrib module (with separate environment file) are also available.

Note for Tensorflow >= 1.4 users

Tensorflow >= 1.4 already has the function tf.nn.selu and tf.contrib.nn.alpha_dropout that implement the SELU activation function and the suggested dropout version.

Note for Tensorflow >= 2.0 users

Tensorflow 2.3 already has selu activation function when using high level framework keras, tf.keras.activations.selu. Must be combined with tf.keras.initializers.LecunNormal, corresponding dropout version is tf.keras.layers.AlphaDropout.

Note for Pytorch users

Pytorch versions >= 0.2 feature torch.nn.SELU and torch.nn.AlphaDropout, they must be combined with the correct initializer, namely torch.nn.init.kaiming_normal_ (parameter, mode=’fan_in’, nonlinearity=’linear’) as this is identical to lecun initialisation (mode=’fan_in’) with a gain of 1 (nonlinearity=’linear’).

Tutorials

Tensorflow 1.x

Multilayer Perceptron on MNIST (notebook)
Convolutional Neural Network on MNIST (notebook)
Convolutional Neural Network on CIFAR10 (notebook)

Tensorflow 2.x (Keras)

Multilayer Perceptron on MNIST (python script)
Convolutional Neural Network on MNIST (python script)
Convolutional Neural Network on CIFAR10 (python script)

Pytorch

Multilayer Perceptron on MNIST (notebook)
Convolutional Neural Network on MNIST (notebook)
Convolutional Neural Network on CIFAR10 (notebook)

Further material

Design novel SELU functions (Tensorflow 1.x)

How to obtain the SELU parameters alpha and lambda for arbitrary fixed points (notebook)

Basic python functions to implement SNNs (Tensorflow 1.x)

are provided as code chunks here: selu.py

Notebooks and code to produce Figure 1 (Tensorflow 1.x)

are provided here: Figure1, builds on top of the biutils package.

Calculations and numeric checks of the theorems (Mathematica)

are provided as mathematica notebooks here: