SpectraLDS: Provable Distillation for Linear Dynamical Systems

Devan Shah 1 Shlomo Fortgang 1 Sofiia Druchnya 1 Elad Hazan 1,2
1 Princeton University 2 Google DeepMind Princeton

Spectral Filters

Spectral Filters

LDS Impulses

LDS Impulses

Reconstructed Spectral Filters

Reconstructed Spectral Filters

Abstract

We present the first provable method for identifying symmetric linear dynamical systems (LDS) with accuracy guarantees that are independent of the system's state dimension or effective memory. Our approach builds upon recent work that represents symmetric LDSs as convolutions learnable via fixed spectral transfor- mations. We show how to invert this representation—recovering an LDS model from its spectral transform—yielding an end-to-end convex optimization procedure. This distillation preserves predictive accuracy while enabling constant-time and constant-space inference per token, independent of sequence length. We evaluate our method, SpectraLDS, as a component in sequence prediction architectures and demonstrate that accuracy is preserved while inference efficiency is improved on tasks such as language modeling.

Background: The Inference Speed Challenge

1 The Quadratic Bottleneck

Transformer architectures suffer from quadratic complexity in sequence length due to self-attention, making them computationally expensive for long sequences. This has driven research toward more efficient alternatives that preserve expressiveness while reducing computational costs. To generate to generate K tokens with a prompt of length T, we require:

$$\text{Attention: } O(T^2 + TK + K^2) \text{ operations}$$

2 State-Space Model Promise

Linear Dynamical Systems and state-space models offer attractive O(L) recurrent inference, but traditional gradient-based training suffers from instability due to vanishing/exploding gradients, especially for systems requiring long-term memory. For an RNN/LDS with state dimension h, we require:

$$\text{RNN/LDS: } O(h \cdot (T + K)) \text{ operations}$$

3 Spectral Methods Trade-off

Recent spectral filtering approaches like the STU provide stable convex training for LDS-like systems, achieving impressive results on long-context tasks. However, they still require expensive convolution operations with complex algorithms (FutureFill) during inference.

$$\text{STU: } O(T \log T + K \log^3 K) \text{ operations}$$

What we offer!

SpectraLDS bridges this gap by enabling stable spectral training followed by distillation to efficient LDS inference. We achieve the training robustness of spectral methods with the inference speed of recurrent models—reducing complexity from O(L log³ L) to O(L) per token while maintaining identical accuracy.

Method Training Stability Inference Speed Memory Usage
Transformers ✓ Stable ✗ O(L^2) ✗ O(L)
Direct LDS ✗ Unstable ✓ O(L) ✓ O(h)
STU ✓ Stable ~ O(L log³ L) ✗ O(L)
SpectraLDS ✓ Stable ✓ O(L) ✓ O(h)

Our Approach: SpectraLDS Distillation

1. Linear Dynamical Systems Primer

Linear Dynamical Systems (LDS) serve as the fundamental building block for state-space models used in sequence modeling. An LDS maintains a hidden state that evolves over time, enabling efficient inference while capturing long-range dependencies. However, direct gradient-based learning suffers from exploding gradients for systems with high effective memory.

$$x_t = Ax_{t-1} + Bu_t, \quad \hat{y}_t = Cx_t + Du_t$$

2. LDS as Convolutions

Any LDS can be expanded into an equivalent convolutional form by unrolling the recurrence relation. This reveals that the LDS output is actually a convolution between the input sequence and the system's impulse response, where each output is a weighted sum of past inputs.

$$y_t = \sum_{i=0}^{t-1} CA^iB \cdot u_{t-i} + Du_t$$ $$= \langle CA^0B, CA^1B, \ldots \rangle * u_{1:t}$$

3. Spectral Basis Representation

The Spectral Transform Unit (STU) represents LDS-like dynamics using coefficients on fixed convolutional filters rather than learning the transition matrix A directly. Any LDS with symmetric A can be approximated using a combination of spectral basis functions, providing stable training with exponential approximation guarantees.

$$y_t^{SF} = \sum_{j=1}^{k} M_{\varphi_j}^+ U_{t-2,j}^+ + \sum_{j=1}^{k} M_{\varphi_j}^- U_{t-2,j}^-$$

4. Spectral to LDS Distillation

Our key insight is that each spectral filter can be approximated as a linear combination of LDS impulse responses. This enables converting the learned spectral representation back into explicit LDS form with provable bounded approximation error, bridging stable spectral training with efficient recurrent inference.

$$\|\Phi_{1:k} - M_f \mu_L(\alpha_1,\ldots,\alpha_h)\| \leq c \lambda_{\max} h e^{-k/\log L}$$

5. Direct Architectural Substitution

The distillation enables replacing STU layers with equivalent LDS layers that maintain the same accuracy while achieving constant-time per-token generation. This provides the training stability of spectral methods with the inference efficiency of recurrent models, reducing computational complexity from O(L log³ L) to O(L) per token.

$$U_{t,j}^+ \approx M_{fj} \sum_{i=0}^{t-1} A^i \vec{1} u_{t-i}^{\top} = \text{LDS}(M_{fj}, A, \vec{1})$$

Empirical Results

Online Learning of Symmetric LDS

SpectraLDS outperforms existing methods on synthetic LDS learning tasks with and without noise. Our approach demonstrates superior sample efficiency and reconstruction accuracy compared to direct LDS training, attention-based methods, and other baselines. The shaded regions show 95% confidence intervals over 8 runs, highlighting the consistent performance advantages of our spectral distillation approach across different noise conditions.


Constant-Time Generation: From Quadratic to Linear Scaling

SpectraLDS achieves linear scaling in sequence length:
The naive convolution approach exhibits quadratic growth, FutureFill variants show logarithmic growth, while our distilled STU-to-LDS layers maintain near linear growth. SpectraLDS also maintains nearly identical runtime regardless of LDS state dimension, indicating that LDS operations are not a computational bottleneck.


Performance of FlashSTU after LDS-for-STU substitution

Model MMLU HellaSwag PIQA BoolQ Winogrande CSQA OBQA ARC-e ARC-c Average
Flash STU 340M 26.58 30.46 65.34 60.12 51.85 20.48 20.60 54.08 23.29 39.20
SpectraLDS 340M 26.58 30.45 65.29 60.12 50.99 20.15 20.20 54.17 23.29 39.03
Transformer 340M 26.81 30.41 64.64 61.10 51.62 19.98 18.80 55.47 21.84 38.96

Language model performance remains identical after distillation. Despite converting convolution-based spectral filters into an explicit LDS formulation, performance across multiple language benchmarks is equivalent, demonstrating that our distillation preserves model accuracy while achieving significant inference speedups.


Spectral Filter Reconstruction Quality

High-fidelity reconstruction of spectral filters.
Our distillation algorithm accurately reconstructs the original spectral filters using an LDS with state dimension 80. The visualization shows both positive spectral filters (red) and their LDS approximations (blue), demonstrating excellent agreement across all 24 filters.

Reconstruction error: The LDS achieves a reconstruction MSE of 1.23 × 10⁻¹², confirming that our theoretical guarantees translate to practical high-precision approximations.

Summary of Key Empirical Findings

  • SpectraLDS achieves the first provable distillation from spectral filters to LDS with bounded approximation error, maintaining identical accuracy while reducing inference complexity.
  • Our method enables constant-time per-token generation (O(L) vs O(L log³ L)) for LDS models, providing significant speedups for long sequence generation without sacrificing model performance.
  • Language model benchmarks demonstrate that distilled models maintain performance across multiple evaluation tasks.
  • Spectral filter reconstruction accurate reconstructs the filters, validating our theoretical approximation bounds in practice.
  • The distillation enables hybrid architectures that combine stable spectral training with efficient recurrent inference, bridging the gap between training robustness and inference speed.
  • Our approach scales independent of LDS state dimension and effective memory, making it feasible for challenge sequence-to-sequence prediction tasks.

${\bf B\kern-.05em{\small I\kern-.025em B}\kern-.08em T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}$

@article{shah2025spectralds,
        title={SpectraLDS: Provable Distillation for Linear Dynamical Systems}, 
        author={Devan Shah and Shlomo Fortgang and Sofiia Druchyna and Elad Hazan},
        year={2025},
        url={https://arxiv.org/abs/2505.17868}, 
  }

This template was largely forked from Kevin Wang's research site.