SELU, SNN successes, and references (2017–2025)
It has been a wild year in AI and especially for self-normalizing networks and SELU activations!
In March 2025, I saw “Transformers without normalization” by Yann LeCun and colleagues drop on arxiv. I thought “now they finally have it”, because Yann has been thinking in similar directions as I did already back in his “Efficient Backprop” tutorial. After all, self-normalizing networks require the initialization named after him (“LeCun’s initialization”). Strangely it’s just a scaled tanh-activation that does the job.. ok!
Then we got all this nice work on conditional flow matching: here 2 or 3 layer SELU-networks with a width of 64 have quasi become standard since Alex Tong’s work and implementation in torch CMF. Here the SELU-network represents the derivative of another function – this is where the smoothness of SELU networks, i.e. smooth derivative of the other function, is clearly the improvement over ReLU networks.
2025 was clearly the year of time-series foundation models and I am very happy that we had a part in this. Clearly our TiRex taking the lead in the GIFT Eval leaderboard (ahead of Amazon’s Chronos) was one of my favourite moments in 2025. However, the other foundation models, like FIM/FIM-ℓ and Flowstate, they all use SELU activations.
One of the quiet but undeniable trends of 2025 is the return of SELU in reinforcement learning. Across several independent lines of work, researchers rediscovered what we already saw in 2017 during the Learning-to-Run challenge: 7 actor–critic methods become meaningfully more stable when the policy/value heads use SELU instead of ReLU. This year, the evidence became impossible to ignore: In PPO-based code optimization, the Pearl system (2025) uses SELU inside its actor–critic MLPs and reports substantially smoother training dynamics during policy updates. HPC scheduling frameworks such as InEPS apply SELU in their PPO actor–critic networks to tame exploding/vanishing activations caused by heterogeneous inputs and reward signals. Multi-objective RL, e.g., latent-conditioned policy gradient methods increasingly defaults to SELU for all hidden layers, because it simply behaves more predictably under policy-gradient noise. I think the pattern is the following: Whenever RL systems avoid batch normalization (which they usually want or must), SELU becomes one of the most stable activations for deep value functions and stochastic policies.
A growing line of graph clustering (DMoN, DGCLUSTER, MetaGC, Potts-GNN) and privacy-preserving GNN work (LPGNN, GAP, UPGNET) consistently replaces ReLU with SELU and reports better convergence or robustness. Since DMoN, GCN use the following forward propagation:
\[H^{(l+1)} = \mathrm{SELU} \left( \tilde{A} H^{(l)} W^{(l)} + H^{(l)} W_{\text{skip}}^{(l)} \right)\]where \(\tilde{A}\) is the normalized adjacency matrix. Classic GCN used \(H^{(l+1)} = \sigma \left( \tilde{A} H^{(l)} W^{(l)} \right)\) with sigmoid or ReLU activation. While the full SNN theory doesn’t directly apply to message-passing, a shallow GNN layer is still “linear aggregation + nonlinearity,” and SELU’s self-normalizing behavior seems to provide more stable training in normalization-free, noisy, or shallow GNN settings.
We’ve re-evaluated machine-learning and deep learning methods from the last 25 years at the Tox21 Data Challenge dataset. Ok, LLMs can do this at least a bit – but far off any reasonable performance. Recent methods like GNNs are a bit behing state-of-the-art, but we were actually extremely suprised that the SELU-networks from 2017 still perform best on this Tox21 leaderboard. People were wondering why AI hasn’t found a new drug yet, nor has improved drug discovery a lot.. yeah, this might be a hint. Deep Learning methods are good at DESIGNING molecules, and are brilliant at MAKING them (in the sense of predicting chemical synthesis routes), but AI systems are obviously BAD AT TESTING those molecules. By TESTING, i mean virtually testing them by predicting their biological properties, such as toxic effects. Well, suprisingly we have to improve the TEST in the DESIGN-MAKE-TEST-ANALYSE cycle.
Tensorflow >= 1.4 already has the function tf.nn.selu and tf.contrib.nn.alpha_dropout that implement the SELU activation function and the suggested dropout version.
Tensorflow 2.3 already has selu activation function when using high level framework keras, tf.keras.activations.selu. Must be combined with tf.keras.initializers.LecunNormal, corresponding dropout version is tf.keras.layers.AlphaDropout.
Pytorch versions >= 0.2 feature torch.nn.SELU and torch.nn.AlphaDropout, they must be combined with the correct initializer, namely torch.nn.init.kaiming_normal_ (parameter, mode=’fan_in’, nonlinearity=’linear’) as this is identical to lecun initialisation (mode=’fan_in’) with a gain of 1 (nonlinearity=’linear’).
are provided as code chunks here: selu.py
are provided here: Figure1, builds on top of the biutils package.
are provided as mathematica notebooks here: