Control using Spiking Neural Networks Trained with Reinforcement Learning and Surrogate Gradients

Published: June 2024 | Reading time: 18 min | Research Project

Authors: K. Van den Berghe, S. Stroobants, G.C.H.E. de Croon

Presented at ICNCE 2024, Aachen, Germany Conference

This work builds upon our comparative study of spiking and non-spiking neural networks in actor-critic reinforcement learning, extending the research to real-world control applications. While the initial study focused on the CartPole task to establish fundamental principles, this follow-up work demonstrates the practical application of spiking neural networks for autonomous drone landing control.

Building on the A2C Comparative Study

Our previous work established key insights about training spiking neural networks with the A2C algorithm, including the effectiveness of surrogate gradients, the importance of temporal dynamics, and the potential for aggressive pruning. This research extends those findings to a more complex, real-world control task: autonomous landing of micro aerial vehicles (MAVs).

The key advancements from the comparative study that enabled this work include:

Sequence-based Training: Leveraging the A2C algorithm's ability to train on full interaction sequences rather than single state-transitions
Surrogate Gradient Methods: Using backpropagation through time with surrogate gradients to train spiking networks effectively
Temporal Dynamics: Understanding how leaky integrate-and-fire neurons can learn temporal relationships
Pruning Strategies: Applying the aggressive pruning techniques developed in the comparative study

Project Overview: Drone Landing Control

The primary objective was to develop a complete pipeline for training SNN-based controllers that could successfully land a MAV using only sonar altitude readings. This required addressing several key challenges that go beyond the simpler CartPole task:

Temporal Information Processing: SNNs operate on spike sequences over time, necessitating adaptations to conventional transition-based RL algorithms
Velocity Estimation: The network needed to learn internal representations of velocity from altitude sequences
Real-world Deployment: The controller had to generalize from simulation to physical hardware
Safety Constraints: Ensuring safe landing velocities while maintaining efficiency

Technical Implementation

The project employed an actor-critic training pipeline with several key components:

Network Architecture

The controller architecture consisted of 2 hidden layers, each with 32 leaky integrate-and-fire (LIF) neurons, connected with linear feedforward layers. The input (sonar altitude) was passed directly to the first layer without encoding, as an input current, while the output was discretized to 7 neurons corresponding to acceleration commands. A softmax function was applied to membrane potentials for action selection.

Training comparison between SNN and ANN architectures

Figure 1: Training performance comparison between SNN and ANN architectures. The SNN shows different convergence characteristics due to its spiking nature and temporal dynamics.

Pre-training Strategy

A critical insight was the necessity of supervised pre-training to help the SNN develop meaningful internal representations. Without proper guidance, the network struggled to learn velocity representations crucial for safe landings. The pre-training task involved learning velocity from sonar readings, which significantly improved training efficiency and reduced computational requirements by an order of magnitude.

Training performance of different model configurations

Figure 2: Training performance comparison across different model configurations, showing the impact of pre-training and network architecture choices.

Reinforcement Learning Algorithms

Two spiking reinforcement learning algorithms were implemented and compared:

Spiking A2C (Advantage Actor-Critic): Adapted for the spiking domain, avoiding the shuffling and batching of interactions
Spiking DQN: Modified to learn from interaction sequences instead of single state-transitions

The A2C approach proved significantly more effective than value function methods, confirming findings from previous research on temporal information processing in reinforcement learning.

Simulation and Training Environment

A lightweight simulator was constructed with second-order dynamics described by:

ḣ(t) = ḣ(t-1) + ḧ(t) · Δt
h(t) = h(t-1) + ḣ(t) · Δt

Where h represents altitude, ḣ is vertical velocity, and ḧ is vertical acceleration after low-pass filtering. The simulator varied starting altitude and velocity across episodes to enhance controller robustness.

Results and Performance

The trained SNN controller demonstrated impressive generalization capabilities across three environments:

Basic Simulation Performance

In the simple simulator, the network exhibited intelligent landing behavior. When starting with non-zero velocity, it promptly issued positive thrust commands to reduce downward velocity, progressively increasing thrust as the MAV approached the ground. This demonstrated the network's ability to modulate thrust based on proximity to the landing surface and its internal velocity representation.

SNN controller performance in basic simulation

Figure 3: Performance of the SNN-based controller in the basic simulation model. The controller shows intelligent landing behavior with appropriate thrust modulation.

PaparazziUAV Integration

The controller was successfully integrated into the PaparazziUAV framework, with acceleration commands mapped to throttle settings for the Parrot Bebop 2 MAV. The model was converted to C code and wrapped in a PaparazziUAV-compatible module, enabling deployment on real hardware.

Real-world Deployment

Most significantly, the same network deployed on a physical MAV achieved successful landings without additional tuning or training. While slight instabilities occurred that weren't present in simulation (likely due to imperfect tuning of underlying stabilization controllers), the controller demonstrated robust real-world performance.

Figure 4: The SNN controller successfully deployed on a physical Parrot Bebop 2 MAV, demonstrating real-world landing capabilities.

ICNCE 2024 Conference Presentation

This work was presented at the International Conference on Neuromorphic Computing and Engineering (ICNCE) 2024 in Aachen, Germany. The conference presentation highlighted the novel contributions in on-policy reinforcement learning for spiking neural networks and the successful real-world deployment of neuromorphic controllers.

The poster presentation emphasized several key aspects of the research:

NeuroBench Benchmarking Results

Using the NeuroBench framework, we demonstrated the computational efficiency advantages of spiking neural networks:

Activation Sparsity: SNNs achieved 68-92% activation sparsity compared to 0% for ANNs
Synaptic Operations: SNNs require only accumulates (ACs) rather than multiply-accumulates (MACs) after spiking layers
Pruning Effectiveness: SNNs could be pruned by 90%+ without significant performance degradation

Figure 5: Noise robustness analysis showing SNNs with temporal dynamics outperform traditional ANNs under noisy conditions.

Noise Robustness Analysis

The conference presentation highlighted the superior noise robustness of spiking neural networks, particularly those with temporal dynamics (leaky neurons). At noise levels above 0.15, the leaky SNN models outperformed their non-leaky counterparts, and at noise levels above 0.04, they surpassed ANN performance.

Key Findings and Challenges

The research revealed several important insights:

Pre-training Necessity: SNNs required supervised pre-training to develop velocity representations
Training Efficiency: SNN training was significantly slower than comparable artificial neural networks
Neuron Efficiency: The trained SNN exhibited numerous dead and saturated neurons, indicating training inefficiency
Action Discretization: The discrete output actions were suboptimal for precise control tasks
Real-world Generalization: The same network successfully deployed on physical hardware without additional tuning

Figure 6: Distribution of neuron activity in the trained SNN, showing the challenge of dead and saturated neurons that limit training efficiency.

Future Research Directions

Several promising avenues for future work have been identified:

Recurrent Replay Buffers: Investigating R2D2 findings for off-policy, policy gradient algorithms
Asymmetric Actor-Critic: Combining SNN actors with ANN critics for faster, more stable training
Stochastic Neuron Models: Exploring models that become more deterministic during training
Inherent Exploration: Encoding exploration-exploitation balance directly in SNN agents
Continuous Action Spaces: Developing methods for continuous control outputs rather than discrete actions

Impact and Significance

This work represents a significant step forward in neuromorphic control systems, demonstrating that SNNs can be effectively trained for real-world robotics applications. The successful deployment on physical hardware validates the potential of neuromorphic computing for autonomous systems, while the identified challenges provide clear directions for future research.

The project contributes to the growing body of work on neuromorphic MAV control, building upon previous research in event-driven control and energy-efficient autonomous systems. The findings have implications for the development of next-generation autonomous vehicles and robotics systems that require both computational efficiency and real-time performance.

The conference presentation at ICNCE 2024 helped disseminate these findings to the broader neuromorphic computing community, fostering collaboration and advancing the field toward practical applications of spiking neural networks in autonomous systems.

Code and Resources

The implementation and training code for this project is available on GitHub: https://github.com/korneelf1/SpikingA2C

For more information about the NeuroBench benchmarking framework used in this research, visit: https://neurobench.ai