Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SSE: Stable Static Embedding

SSE: Stable Static Embedding

Static embedding models enable fast inference due to their simple architecture, but, it is well known that improving their structural expressiveness is challenging. At the same time, as corpora continue to grow in scale, the demand for both higher efficiency and higher accuracy in embedding models has increased significantly. In this work, we propose a simple yet effective method called SSE (Stable Static Embedding), which incorporates Separable DyT (Dynamic Tanh normalization). We demonstrate that SSE achieves higher retrieval performance than prior approaches while using only half the number of parameters. Despite having only 16 million parameters, SSE attains a mean NanoBEIR (English) nDCG@10 score of 0.512. By leveraging Separable DyT, SSE effectively regulates gradient flow and suppresses inter-dimensional imbalance and overfitting, thereby improving generalization performance. Our method provides a new perspective on static embedding models and offers a pathway toward faster and more accurate retrieval systems.

Avatar for Rikka Botan

Rikka Botan

April 03, 2026

More Decks by Rikka Botan

Other Decks in Research

Transcript

  1. SSE(Stable Static Embedding): Unlocking the Potential of Static Embeddings, A

    Dynamic Tanh Normalization Approach without Speed Penalty
  2. ◆Hobby Making sweets, Tea, Listening to classical music, Clothes ◆Recent

    Activities Silver Award: Liquid AI Hackathon Series | Tokyo Article writing (related to Mamba, LFM2 (LTCs)) About us Independent researcher (machine learning / algebra / mathematical logic) Rikka Botan X(Twitter) Portfolio
  3. The author acknowledge the support of Saldra, Witness and Lumina

    Logic Minds for providing computational resources used in this work. Firstly
  4. ◆The Importance of Fast Search RAG (Retrieval-Augmented Generation), recommendation systems,

    internal document searches In these systems, it is necessary to quickly retrieve relevant information from millions to tens of billions of documents. Balancing response speed and search accuracy greatly impacts the user experience. Many systems use a configuration called Retrieval + Reranking. Introduction ◆Related Studies Year Paper / Model Author Feature 2013 Word2Vec Tomas Mikolov et. al. A method for embedding words into low-dimensional vectors using Skip-gram / CBOW. One of the earliest static embeddings using word co-occurrence. 2014 GloVe Jeffrey Pennington et. al. Static embeddings using statistical information from word co-occurrence matrices. A representative method alongside Word2Vec. 2019 Sentence-BERT Nils Reimers et. al Generate sentence embeddings using a Siamese BERT architecture. (Learns similarity between sentence vectors) High quality but computationally expensive. 2024 Model2Vec MinishLab A method for distilling Sentence Transformers to create a compact static embedding model. 2025 Static Retrieval MRL Tom Aarsen Fast Static Sentence Embedding with Averaged Token Embeddings. Matryoshka Loss & Contrastive Learning. 100–400× Faster on CPU.
  5. Research perspective Question The architecture of the static embedding model

    has not changed from Word2Vec. Only the learning methods have been improved. Challenge The architecture is simple, making it difficult to implement improvements without sacrificing speed (and adopting highly expressive operations is also challenging). Idea Control of the representation space through interaction with the learning mechanism and process, rather than through individual modules.
  6. ◆Introduction of Separable DyT (Dynamic Tanh normalization) and Construction of

    SSE (Stable Static Embedding) Method ◆Gradient Control and Improved Generalization of Representation Space by Separable DyT ▪Architecture ▪Algorithm 𝑦𝑘 = 𝑐𝑘 tanh 𝑎𝑘 𝑥𝑘 + 𝑏𝑘 𝜕𝑦𝑘 𝜕𝑥𝑘 = 𝑐𝑘 𝑎𝑘 sech2 𝑎𝑘 𝑥𝑘 + 𝑏𝑘 Maintenance of unsaturated dimensions ൗ 𝜕yi 𝜕xi → 0 ( ai xi + bi ≫ 1) ൗ 𝜕yi 𝜕xi ≈ ci ai ( ai xi + bi < 1) Decay of the saturated dimension Learning signals with high noise are attenuated. Learning signals with stable information are maintained. Without explicit hyperparameters, implicit regularization enhances the generalization performance of the representation space.
  7. Evaluations Comparison of (a) Loss and (b) Gradient Norm Across

    Training Steps. ➢ Maintain the gradient even in the later stages of training, and continue updating parameters
  8. Evaluations NanoBEIR mean nDCG@10 Across Training Steps. NanoBEIR English mean

    nDCG@10 vs Matryoshka Embedding Truncation. ➢ Consistently surpasses the baseline in the latter half of the learning process and at large embedding dimensions.
  9. Evaluations (a) Retrieval performance (nDCG@10) across NanoBEIR English tasks. (b)

    Mean nDCG@10 vs. inference speed (QPS: queries per second) measured on TREC-COVID and Quora using an Intel® Core Ultra 7 265K (3.90 GHz) with batch size 32. ➢ In the English document retrieval task, we have reached the frontier of speed and accuracy.
  10. Evaluations (a) Retrieval performance (nDCG@10) across NanoBEIR Japanese tasks. (b)

    Mean nDCG@10 vs. inference speed (QPS: queries per second) measured on Miracl using an Intel® Core Ultra 7 265K (3.90 GHz) with batch size 32. ➢ In the Japanese document retrieval task, we have also reached the frontier of speed and accuracy.
  11. Evaluations PCA Spectrum on the 13 NanoBEIR English Datasets: Normalized

    Eigenvalue Decay (a) Linear Scale, (b) Logarithmic Scale. In SSE, the decay of eigenvalues was observed at smaller dimensional sizes. ➢ By suppressing noise, low-rank regularization (concentration of information into a compact subspace) is implicitly achieved.
  12. Application It is possible to score tens of thousands of

    documents within one second using only a general CPU. Combined with web search (such as DuckDuckGo), lightweight reference searches can also be realized.
  13. 16