Back to Projects

Adaptive Surrogate Gradients for SNNs - Part 2: Sequential Reinforcement Learning

Published: 2024 | Reading time: 15 min | Blog Series

Paper accepted at: NeurIPS 2025 Conference

← Part 1: Surrogate Gradients | Part 2 of 3: Sequential RL & Sim-to-Real Transfer | Part 3: Reality Gap →

Surrogate Gradients Research: Flight Trajectory, Training Performance, and Control Architecture

Leveraging the temporal processing capabilities of spiking neural networks (SNNs) in reinforcement learning requires training on sequences rather than individual transitions. However, this presents a critical challenge: subpar initial policies often lead to early episode termination, preventing the collection of sufficiently long sequences to bridge the warm-up period required by stateful networks.

This research introduces a novel reinforcement learning algorithm tailored for continuous control with spiking neural networks, explicitly leveraging their inherent temporal dynamics without relying on frame stacking. We demonstrate the efficacy of our approach by training a low-level spiking neural controller for the Crazyflie quadrotor, successfully bridging the reality gap without observation history augmentation.

The Warm-Up Period Challenge

A key consideration when training stateful neural networks with reinforcement learning is gathering sufficiently long sequences to enable efficient training, while also allowing the hidden states to stabilize for several timesteps after initialization before calculating the gradient. This stabilization period, which we call the warm-up period, is crucial for proper gradient computation in spiking neural networks.

When controlling a drone, subpar controller performance leads to too short interactions, making it challenging to achieve baseline performance. The agent cannot act long enough to gather data that would allow it to improve, creating a fundamental chicken-and-egg problem in training.

Drone control task overview

Figure 1: The spiking controller receives position, velocity, orientation, and angular velocity inputs and outputs motor commands for the Crazyflie quadrotor.

Jump-Start Reinforcement Learning Framework

To address the warm-up period challenge in sequential SNN training, we adapt the Jump-Start Reinforcement Learning (JSRL) framework. This approach leverages a pre-trained guide policy to create a curriculum of starting conditions for a secondary policy.

We implement a privileged, non-spiking actor trained through TD3, whose primary function is to bridge the critical warm-up period required by the SNN. Both this guiding policy and the non-spiking critic receive privileged information in the form of action and observation histories.

Sequential Reinforcement Learning Algorithm

Our approach combines the benefits of both offline and online learning through a hybrid training strategy:

Key Components:

  • Guiding Policy: TD3-trained non-spiking actor that provides stable initial behavior
  • Spiking Policy: LIF-based neural network that learns temporal dynamics
  • Hybrid Replay Buffer: Contains transitions from both guiding and spiking policies
  • Behavioral Cloning Term: Gradually decaying BC regularization for stable training

Training Objective Function

The function to be maximized combines reinforcement learning objectives with behavioral cloning:

L_π = E[τ∼D] [Σ(i=0 to 100) Q_φ₁(s_τ,i, π_θ(s_τ,i|(s_τ,i-1,...,s_τ,i))) · 1_{i≥50}] + λ E[τ∼D] [Σ(i=0 to 100) ||π_θ(s_τ,i|(s_τ,i-1...s_τ,i)) - a_τ,i||² · 1_{i≥50}]

Where:

  • Q_φ₁ represents the first critic network (as used in TD3)
  • τ denotes a sequence sampled from the replay buffer (length 100)
  • s_τ,i and a_τ,i correspond to the i-th observation and action
  • λ controls the strength of BC regularization and decays over time
  • 1_{i≥50} equals zero during the warm-up period (50 steps) and switches to one afterward

Spiking Actor-Critic Architecture

The spiking policy employs two hidden layers of Leaky Integrate-and-Fire (LIF) neurons. Each neuron receives an input current I_in, which incrementally charges its membrane potential U. Over time, the potential decays at a rate determined by the leak factor β.

When the membrane potential exceeds a defined threshold U_thr, the neuron emits a spike s and its potential is subsequently reset:

U[t+1] = β U[t] + I_in[t+1] - s · U_thr

Where the spiking mechanism is described by:

s = {1, if U[t] > U_thr; 0, otherwise}

While the actor is spiking, the critic is an artificial neural network that receives the aforementioned states and an action history of 32 timesteps. This asymmetric setup has been demonstrated to successfully leverage the improved training stability of ANNs while maintaining the temporal processing capabilities of SNNs.

Micro Aerial Vehicle Control Task

We implement and evaluate our approach on a quadrotor position control task using the Crazyflie 2.1 platform. The control architecture employs a spiking neural network that processes a state vector comprising:

  • Position: (x, y, z) coordinates
  • Linear velocity: (v_x, v_y, v_z)
  • Orientation angles: (θ, φ, ψ)
  • Angular velocities: (p, q, r)

Based on these inputs, the SNN outputs motor commands for each motor (m₁, m₂, m₃, m₄) at a control frequency of 100Hz.

Crazyflie coordinate system

Figure 2: The Crazyflie quadrotor platform and coordinate system used for position control.

Encoding and Decoding

SNNs work in a spike-based domain. To encode continuous values to spikes and vice versa, population-based encoding and decoding are used. A linear layer encodes input values into currents, and another decodes output spikes into motor commands.

Training on Sequences

To fully leverage the temporal processing capabilities of SNNs in reinforcement learning, we extend training from single transitions to full sequences and introduce a reward curriculum that gradually increases penalties on position, velocity, and action magnitude to encourage stable, robust behavior.

Training comparison across methods

Figure 3: Offline methods (BC and TD3BC) struggle as the reward function adapts, failing to generalize beyond the initial dataset.

Sequence length comparison

Figure 4: While all methods eventually learn to fly, TD3-trained policies frequently terminate early due to unstable exploration, whereas TD3BC+JSRL achieves longer and more stable flight trajectories.

Comparison with Baseline Methods

We compare our TD3BC+JSRL approach with several baseline methods:

  • Behavioral Cloning (BC): Suffers from the unseen curriculum, unable to adapt to changing reward functions
  • TD3BC: Initially shows ability to leverage reward information but performance drops as the reward function diverges from the dataset
  • TD3: Struggles to gather meaningful sequences early in training, often resulting in premature episode terminations
  • TD3BC+JSRL: Demonstrates robust learning even under challenging curriculum, with substantial performance improvements

Bridging the Reality Gap

We quantitatively evaluated the computational efficiency of our sequential SNN approach using NeuroBench. Results show that temporally-trained SNNs match ANN performance while exhibiting distinct computational traits.

Real-world deployment on Crazyflie

Figure 5: When deployed on the Crazyflie, the spiking actor displays oscillatory behavior but can successfully fly maneuvers such as circles.

Computational Efficiency Analysis

Despite a higher memory footprint, SNNs benefit from activation sparsity and primarily use energy-efficient accumulates (ACs) instead of energy-hungry multiply-accumulates (MACs), making them well-suited for neuromorphic deployment.

Real-World Deployment Results

When deployed on the Crazyflie, the SNN controller exhibits oscillatory behavior, likely due to limited output resolution and lack of explicit action history. However, the SNN is still able to execute complex maneuvers like circles, eight figures, and squares.

We deployed the trained SNNs on Crazyflie variants with standard and modified motor-propeller setups. Compared to ANN controllers, SNNs achieved lower position error under ideal conditions but had reduced reliability. Notably, while ANNs without action history failed to control the drone, our SNN leveraged temporal dynamics to maintain control, albeit with occasional crashes.

Key Contributions and Impact

This work makes several significant contributions to the field of neuromorphic reinforcement learning:

  • Novel RL Algorithm: Introduces TD3BC+JSRL specifically designed for spiking neural networks
  • Warm-Up Period Solution: Addresses the critical challenge of bridging the warm-up period in sequential training
  • Real-World Validation: Successfully demonstrates sim-to-real transfer without observation history augmentation
  • Computational Efficiency: Shows that SNNs can match ANN performance while offering energy efficiency advantages

Future Research Directions

Several promising avenues for future work have been identified:

  • Improved Stability: Incorporating angular velocity penalties, throttle deviation outputs, or increased control frequency
  • Enhanced Encoding: Developing more sophisticated encoding and decoding methods for continuous control
  • Multi-Agent Systems: Extending the approach to multi-agent scenarios and swarm robotics
  • Hardware Optimization: Optimizing the algorithm for specific neuromorphic hardware platforms

Conclusion

This research successfully addresses the critical challenges of training spiking neural networks for real-world control tasks. By introducing a novel reinforcement learning algorithm that explicitly leverages the temporal dynamics of SNNs and providing a solution to the warm-up period problem, we demonstrate that spiking neural networks can be effectively trained for complex control tasks and successfully deployed in real-world robotic systems.

The work contributes to the broader goal of making neuromorphic computing more accessible and effective for real-world applications, particularly in resource-constrained environments where energy efficiency is critical. The successful sim-to-real transfer without observation history augmentation represents a significant step forward in the field of neuromorphic robotics.