Home

Published

- 5 min read

What I’m Working On: Speech Representation Learning (Early Research Notes)

img of What I’m Working On: Speech Representation Learning (Early Research Notes)

This post summarizes my current independent research direction in speech representation learning. It is intentionally high-level and focuses on the research question, modeling choices, and evaluation plan rather than results.

TL;DR

  • I am exploring continuous-latent autoencoder training for speech representations, with joint objectives for semantic structure and reconstruction.
  • The goal is to improve strong baselines that already unify representation learning and generation (e.g., UniWav), targeting both performance and resource efficiency. [3]
  • A core hypothesis is that geometric regularization can substitute for KL-style regularization in an autoencoder-style framework, improving stability and downstream discriminability.

Research direction

The current landscape includes self-supervised representation learners such as wav2vec 2.0 and HuBERT, as well as more recent SSL frameworks optimized for speed and scalability. [2][10][9] There is also increasing work on general-purpose audio encoders (e.g., OpenBEATs) and unified SSL frameworks for speech/audio. [7][6]

There are also “fast Conformer” lines of work that position efficient encoders as broadly useful building blocks across speech tasks (e.g., NEST). [15][12][4]

Separately, neural audio codecs have shown that high-fidelity reconstruction is achievable with compact representations (e.g., EnCodec), and foundation models such as Moshi build on this class of ideas for speech-text interaction. [11][8]

Continuous audio language modeling is another relevant direction here: CALM formalizes continuous representations for generative modeling, and it is useful as a reference point for thinking about continuous-latent objectives and codec-like bottlenecks in modern speech systems. [16][11][8]

Recent work has also shown that it is possible to unify representation learning and generation within a single model (e.g., UniWav). [3] My focus is not the existence of this unification, but whether an autoencoder-centric formulation can improve the baseline—particularly under compute and data constraints, and in lower-resource settings.

Concretely, I am testing a continuous latent autoencoder approach that aims to:

  • improve semantic discriminability while retaining reconstruction fidelity,
  • remain resource-efficient in lower-resource settings (including experiments targeting Bengali speech),
  • provide a more stable regularization strategy than standard VAE-style KL divergence (see hypothesis below).

Proposed approach

The working design includes:

  • an encoder backbone inspired by Zipformer (with modifications for efficiency), [1]
  • architectural mechanisms to preserve information flow through the bottleneck (to support reconstruction),
  • a semantic objective to shape latent-space geometry,
  • a reconstruction objective (and, potentially, adversarial components) to maintain perceptual quality.

Key hypothesis (SigReg vs. KL)

One of the hypotheses I am actively testing is that geometric regularization can serve as a more effective constraint for continuous latents than a conventional KL divergence term, particularly when the model is trained in an autoencoder-style framework.

This direction is motivated by recent work arguing that unified autoencoding can reconcile semantic and low-level representations (the “Prism Hypothesis”), which aligns with the intuition that reconstruction and semantics can be modeled within a single framework rather than treated as competing objectives. [5]

I am also using representation-autoencoder-style ideas as implementation guidance around training dynamics and latent usage. [14]

Evaluation (ongoing)

I am benchmarking against strong existing systems (including UniWav, wav2vec 2.0, HuBERT, and other recent SSL frameworks) and tracking both: [3][2][10][9][6]

  • reconstruction-oriented metrics (e.g., spectral distances; perceptual scores when appropriate),
  • downstream discrimination tasks that reflect semantic usefulness,
  • resource sensitivity (training stability, data efficiency, and compute requirements), with a focus on lower-resource settings such as Bengali.

References

[1] Z. Yao et al., “Zipformer: A faster and better encoder for automatic speech recognition,” 2023, doi: 10.48550/ARXIV.2310.11230.

[2] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” Oct. 22, 2020, arXiv: arXiv:2006.11477. doi: 10.48550/arXiv.2006.11477.

[3] A. H. Liu et al., “UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation,” Mar. 02, 2025, arXiv: arXiv:2503.00733. doi: 10.48550/arXiv.2503.00733.

[4] Y. Yang, F. Yao, M. Liao, and Y. Huang, “Ubranch Conformer: Integrating Up-Down Sampling and Branch Attention for Speech Recognition,” IEEE Access, vol. 12, pp. 144684–144697, 2024, doi: 10.1109/ACCESS.2024.3471183.

[5] W. Fan, H. Diao, Q. Wang, D. Lin, and Z. Liu, “The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding,” Dec. 22, 2025, arXiv: arXiv:2512.19693. doi: 10.48550/arXiv.2512.19693.

[6] X. Yang et al., “SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations,” Oct. 29, 2025, arXiv: arXiv:2510.25955. doi: 10.48550/arXiv.2510.25955.

[7] S. Bharadwaj et al., “OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder,” Jul. 18, 2025, arXiv: arXiv:2507.14129. doi: 10.48550/arXiv.2507.14129.

[8] A. Défossez et al., “Moshi: a speech-text foundation model for real-time dialogue,” Oct. 02, 2024, arXiv: arXiv:2410.00037. doi: 10.48550/arXiv.2410.00037.

[9] Y. Yang et al., “k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning,” Nov. 26, 2024, arXiv: arXiv:2411.17100. doi: 10.48550/arXiv.2411.17100.

[10] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” Jun. 14, 2021, arXiv: arXiv:2106.07447. doi: 10.48550/arXiv.2106.07447.

[11] A. Défossez et al., “High Fidelity Neural Audio Compression,” Oct. 24, 2022, arXiv: arXiv:2210.13438. doi: 10.48550/arXiv.2210.13438.

[12] D. Rekesh et al., “Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition,” Sep. 30, 2023, arXiv: arXiv:2305.05084. doi: 10.48550/arXiv.2305.05084.

[13] Z. Ma et al., “emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation,” Dec. 23, 2023, arXiv: arXiv:2312.15185. doi: 10.48550/arXiv.2312.15185.

[14] B. Zheng, N. Ma, S. Tong, and S. Xie, “Diffusion Transformers with Representation Autoencoders,” Oct. 13, 2025, arXiv: arXiv:2510.11690. doi: 10.48550/arXiv.2510.11690.

[15] H. Huang et al., “NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks,” Aug. 23, 2024, arXiv: arXiv:2408.13106. doi: 10.48550/arXiv.2408.13106.

[16] S. Rouard, M. Orsini, A. Roebel, N. Zeghidour, and A. Défossez, “Continuous Audio Language Models,” Sep. 09, 2025, arXiv: arXiv:2509.06926. doi: 10.48550/arXiv.2509.06926.

Related Posts

There are no related posts yet. 😢