Spiking Neural Networks (SNNs) represent a promising algorithmic approach for neuromorphic computing systems, offering native temporal processing and significantly improved energy efficiency compared to conventional deep learning architectures. However, training SNNs for control tasks remains challenging due to the non-differentiable nature of spiking neurons, which complicates gradient-based optimization.
This research investigates the critical role of surrogate gradients in training deep spiking neural networks, with a particular focus on how different slope settings and scheduling strategies affect training performance across supervised learning and reinforcement learning scenarios.
The Challenge of Non-Differentiable Spiking Neurons
Spiking neurons operate by accumulating membrane potential until it exceeds a threshold, at which point they emit a discrete spike and reset their potential. This spiking behavior is inherently non-differentiable, making traditional backpropagation impossible. Surrogate gradients provide a solution by approximating the derivative of the spike function with a continuous, differentiable function.
A critical hyperparameter in this process is the slope k of the surrogate function, which determines the sensitivity of the gradient near the spiking threshold. We adopt a fast sigmoid function and examine slope configurations ranging from shallow (k = 1) to steep (k = 100).
Figure 1: The slope of the surrogate gradient dictates the range of inputs for which a gradient exists. Steeper slopes closely resemble the Dirac delta function but restrict non-zero gradients to a narrow input range.
Gradient Propagation in Deep Networks
Our analysis reveals that surrogate gradient slope settings have profound effects on gradient propagation through deep networks. As shown in Figure 1, steeper slopes closely resemble the Dirac delta function but restrict non-zero gradients to a narrow input range. In contrast, shallower slopes lead to a greater number of non-zero gradients and thus increase the total gradient magnitude, particularly in deeper layers.
Figure 2: A more shallow slope carries the gradient deeper through the network, suffering less from vanishing gradients. The network has 4 hidden layers; layer 0 is the first hidden layer, and layer 4 is the output layer.
While the surrogate gradient's slope affects the quantity of weight updates per backward pass, it also introduces noise into the gradient computation. Since the true gradient for deeper network layers does not exist, we analyze the relationship between steep surrogate gradients (k=100), which approximate the true gradient, and shallow surrogate gradients using cosine similarity:
Figure 3: A shallower slope introduces bias and variance, computed using the cosine similarity. The cosine similarity for shallow slopes reduces to 0, meaning weight updates in deeper networks become essentially random.
Surrogate Gradients Across Learning Regimes
We analyze the effect of surrogate gradient slope choices across both fully supervised learning, behavioral cloning (BC), and fully online training algorithms like TD3. To eliminate the effect of warm-up periods during training for this analysis, we stack a history of observations, process multiple forward passes per observation, and reset the SNN between subsequent actions.
Supervised Learning Results
In supervised learning, final performance is largely unaffected by the surrogate slope setting, though training dynamics vary significantly:
- Shallow slopes add noise, requiring more updates but potentially improving exploration
- Steep slopes yield small gradient magnitudes, slowing progress but maintaining gradient accuracy
- Intermediate slopes provide a balance between gradient magnitude and accuracy
Reinforcement Learning Results
In online reinforcement learning, we observed a pronounced preference for much shallower slopes. The reduced cosine similarity between true and surrogate gradients introduces noise that naturally enhances exploration, similar to parameter noise techniques. However, this comes with trade-offs:
- Higher variability in final performance across runs
- Increased risk of poor intermediate updates corrupting the replay buffer
- Less stable training due to the heightened risk of low-quality experiences
Adaptive Slope Scheduling
Beyond fixed slope settings, we investigate two scheduling methods for the surrogate gradient slope:
1. Interval Scheduling
Gradually changes the slope from 1 to 100 at a fixed interval during training. This approach provides a systematic transition from exploration-friendly shallow slopes to more accurate steep slopes.
2. Adaptive Scheduling
Uses a weighted sum of the low-passed first-order derivative of the reward score history and the low-passed reward score itself:
Where k_t is the slope at time t, r_{t-i} is the reward score at time t-i, and r'_{t-i} is the first-order derivative of the reward score at time t-i. The proportional term is largely responsible for not decreasing the slope when maximum performance is reached, preventing the destruction of progress.
Figure 4: Scheduled slope settings improve training efficiency, greatly reducing the number of epochs to reach a reward of 100.
Figure 5: Final performance of trained agents lies in the same regime as fixed slope experiments, but with improved training efficiency.
Key Findings and Implications
Our research reveals several important insights about surrogate gradients in deep spiking neural networks:
- Training Efficiency: Scheduled slope settings significantly improve training efficiency, reducing the number of epochs needed to reach target performance
- Performance Parity: Final performance of agents trained with scheduled slopes matches those trained with fixed optimal slopes Hyperparameter Optimization: Scheduled approaches can eliminate the need for exhaustive hyperparameter sweeps across different slope settings
- Learning Regime Dependence: The optimal slope setting varies significantly between supervised learning and reinforcement learning scenarios
- Exploration vs. Accuracy Trade-off: Shallow slopes enhance exploration but introduce noise, while steep slopes maintain accuracy but may limit exploration
Practical Recommendations
Based on our findings, we recommend the following approaches for training spiking neural networks:
For Supervised Learning
Use steep slopes (k = 50-100) or scheduled slopes that transition from shallow to steep during training. The focus should be on maintaining gradient accuracy while ensuring sufficient gradient flow.
For Reinforcement Learning
Use shallow slopes (k = 1-10) or adaptive scheduling that responds to training progress. The additional noise introduced by shallow slopes can enhance exploration, which is crucial for RL tasks.
For Unknown Scenarios
When the optimal slope setting is unclear, use adaptive scheduling to automatically adjust the slope based on training progress. This approach provides a robust solution that adapts to the specific characteristics of the task and network architecture.
Future Research Directions
Several promising avenues for future work have been identified:
- Task-Specific Scheduling: Developing scheduling strategies tailored to specific task characteristics and network architectures
- Multi-Objective Optimization: Balancing exploration, accuracy, and training stability in a unified framework
- Hardware-Aware Training: Optimizing surrogate gradient settings for specific neuromorphic hardware platforms
- Theoretical Analysis: Developing theoretical frameworks to predict optimal slope settings based on network architecture and task complexity
Impact and Significance
This work advances both the theoretical understanding of surrogate gradients in spiking neural networks and provides practical methodologies for training neuromorphic controllers. The insights gained from this research have important implications for the development of energy-efficient, temporally-aware neural networks for real-world applications.
The findings contribute to the broader goal of making spiking neural networks more accessible and effective for complex control tasks, particularly in resource-constrained environments where energy efficiency is critical.