The team at Perplexity is pleased to announce the release of pplx-embed-v1 and pplx-embed-context-v1, two new text embedding models designed for efficient retrieval on a massive scale. These models are optimized to handle the complexities of web-based information access.
Key Features and Models:
These models come in two sizes: 0.6 billion and 4 billion parameters. The smaller 0.6B models are designed for low-latency embedding generation, ideal for resource-constrained environments. The 4B models prioritize retrieval quality and aim to deliver the best results.
pplx-embed-v1: This version is optimized for standard dense text retrieval tasks.pplx-embed-context-v1: This model accounts for surrounding document-level context when generating embeddings, which can be particularly helpful for nuanced searches.
Technical Details:
- Quantization: Both models utilize quantization techniques to significantly reduce storage requirements – up to 4x and 32x compression using INT8 or binary formats. This makes storing and retrieving embeddings at scale much more practical.
- Training: The models undergo a multi-stage training pipeline, starting with Qwen3 base models and employing diffusion-based continued pretraining to create bidirectional encoders. This process enhances semantic understanding and improves retrieval performance.
- No Instruction Prefixes: Unlike many modern embedding models, these models don’t require instruction prefixes, simplifying integration and reducing potential inconsistencies.
Benchmark Results:
The models have shown impressive results on several benchmarks, including MTEB, BERGEN, ToolRet, and ConTEB. Specifically:
- MTEB (Multilingual, v2): The 4B INT8 model achieved an average nDCG@10 of 69.66%, matching Qwen3-Embedding-4B and exceeding gemini-embedding-001.
- ConTEB: The
pplx-embed-context-v1-4Bmodel attained the highest score of 81.96% on this benchmark.
Real-World Performance:
Internal benchmarks using PPLXQuery2Query and PPLXQuery2Doc demonstrate the models’ effectiveness:
- PPLXQuery2Query (2.4M corpus): The 4B INT8 model achieved 73.5% Recall@10, outperforming Qwen3-Embedding-4B.
- PPLXQuery2Doc (30M corpus): The 4B model delivered a recall of 91.7% at 1000 retrieved documents.
Getting Started:
These models are available on Hugging Face (under the MIT License) and through the Perplexity API. Support is provided via Transformers, SentenceTransformers, Text Embeddings Inference, and ONNX.
For further details, we encourage you to consult the technical report, Hugging Face model collection, and API documentation.

Gladstone is a tech virtuoso, boasting a dynamic 25-year journey through the digital landscape. A maestro of code, he has engineered cutting-edge software, orchestrated high-performing teams, and masterminded robust system architectures. His experience covers large-scale systems, as well as the intricacies of embedded systems and microcontrollers. A proud alumnus of a prestigious British institution, he wields a computer-science-related honours degree.
