Semantic Search and Embeddings: Modernizing PHP Search with Vector Databases

Semantic Search and Embeddings Modernizing PHP Search with Vector Databases

The Hook "how to deal with stress before bed" MySQL
FULLTEXT Returns nothing. Postgres tsvector Returns nothing. Twelve articles on anxiety, sleep hygiene, and evening relaxation exist in the database. The content is there. The search is the problem — because none of them contain that exact phrase.

What This Talk Is A Working Tour Semantic search in
a real Laravel app, not a toy demo. Embeddings Without Math Intuition-first explanations. No ML background needed. Real Storage Vector storage in Postgres — and where else it lives. Concrete Code Real costs. Real pitfalls. Ship this on Monday.

SECTION 2 — THE CASE STUDY Meet DailyMedToday A Laravel
app publishing a curated meditation every day, a searchable archive of past sessions, and a space for user notes and reflections. Real production traffic. Real search problems.

The Stack Web Layer Laravel + PHP-FPM behind Nginx Cache
Redis for sessions and caching Primary DB Postgres with pgvector extensions Embeddings Voyage AI over HTTP REST API Orchestration Kubernetes managed via Displace CLI Key Decisions pgvector on Postgres No second database to operate or back up. Voyage AI over HTTP Simple REST API — no SDK sprawl. Displace CLI on K8s The deployment wrapper from yesterday's talk.

What People Search For Two kinds of content. Two kinds
of search problem. Editorial Content Admin-curated meditations with titles, summaries, tags. Public. Needs to surface for every user query. User-Generated Content Private journal entries, reflection notes, search queries themselves. Needs to be findable — but only by the user who wrote it. Both content types need semantic search. Only one is public. Your access-control layer must account for this distinction.

Where Keyword Search Fails Three real examples from DailyMedToday's search
logs: 1 Synonyms "anxious" vs "worried" vs "on edge" — the content uses all three. Keyword search treats them as different strings. 2 Intent "can't sleep" vs "insomnia tips" — same intent, zero lexical overlap. Lexical search returns nothing. 3 Typos "medittation" — even with fuzzy matching, you're fighting character distance, not meaning.

SECTION 3 — EMBEDDINGS, INTUITIVELY Meaning over Matching Keyword Search
asks: "Do these strings appear together?" Semantic Search asks: "Do these mean the same thing?" That's the entire conceptual leap. Everything that follows is implementation detail.

What an Embedding Actually Is A list of numbers Maybe
256 of them. Maybe 1024. Maybe 3072. Each number captures something about the meaning of a piece of text. Opaque by design You don't get to know what dimension 417 means individually. You don't need to. The model handles that. The whole point Texts with similar meaning produce vectors that point in similar directions. That's the property we exploit.

Vectors as Coordinates Stress Cluster Calm Cluster Eiffel Tower Lasagna
The Core Intuition Similar meanings cluster in space. "Anxiety" and "worry" land near each other. "Lasagna" lands somewhere else entirely. This 2D sketch extends into 1024 dimensions. The geometry is the same — we just can't draw it. Clustering is emergent — the model learned it from enormous text corpora. You inherit that geometry for free.

Cosine Similarity in One Slide 1.0 Identical direction — essentially
the same meaning. ~0.85 High similarity — synonyms, paraphrases. ~0.5 Related but distinct — same domain, different focus. 0.0 Unrelated — "meditation" and "lasagna." –1.0 Opposite direction — rare in practice. The Postgres <=> operator computes cosine distance (1 − similarity). That's the only operator you need to know.

Where the Meaning Comes From You're renting learned geometry A
pre-trained model has already ingested enormous text corpora and learned that "anxious" and "worried" point in similar directions in vector space. You are not training anything. You're calling an API and getting numbers back. This is what makes embeddings cheap, fast, and accessible to any PHP team today.

SECTION 4 — CHOOSING THE PIECES Two Decisions, Two Services
Which embedding model? Produces the vectors. Determines quality, dimensionality, and cost per token. Where do vectors live? Your vector store. Determines query latency, operational complexity, and scale ceiling. These decisions are independent. You can swap either side later — at a migration cost. Choose deliberately, but don't overthink it up front.

The Model: Why Voyage AI Benchmark Quality Strong performance on
retrieval benchmarks — especially for short- form content like the meditation summaries in DailyMedToday. Low Latency Consistent sub-100ms response on typical payloads. Sensible pricing. A clean HTTP API — no SDK required. Valid Alternatives OpenAI, Cohere, or self-hosted (BGE, E5, nomic-embed-text via Ollama). Pick on quality, latency, cost, and data- residency requirements.

Voyage AI at a Glance The kind of comparison table
you'd build when evaluating any provider: Model Dimensions Max Tokens $/1M tokens Latency voyage-4 256-2048 32,000 ~$0.06 <100ms voyage-4-lite 256-2048 32,000 ~$0.02 <60ms voyage-4-large 256-2048 32,000 ~$0.12 <150ms voyage-4-lite is the cost-sensitive default for DailyMedToday's user-generated content. Editorial content uses voyage- 4 for higher recall quality.

Storage: pgvector -- One command to enable: CREATE EXTENSION IF
NOT EXISTS vector; -- Add a vector column: ALTER TABLE meditations ADD COLUMN embedding vector(1024); -- Done. No new service. -- No second backup job. -- No second connection pool. Why Postgres for DailyMedToday We already run Postgres for everything else. One CREATE EXTENSION and we're done — no new operational surface to learn, monitor, or scale. For a Laravel-shaped team, operational simplicity matters more than raw vector throughput at our scale.

Storage Alternatives The Decision Rule Use the database you already
operate until it stops being enough. MySQL shop? MySQL 9 native VECTOR type — no extension needed. Already on ES? dense_vector field works today. Scale or advanced ANN? Graduate to Qdrant or Weaviate.

SECTION 5 — THE PIPELINE The Shape of the System
Two-Path Embedding Pipeline Write path Text → Voyage API Nearest- neighbor search SQL ORDER BY cosine LIMIT 10 Query path User query → Voyage API Storage Postgres row with vector(1024) Embedding generation Voyage returns 1024 floats Query embedding Voyage returns embedding

The Schema CREATE TABLE meditations ( id BIGSERIAL PRIMARY KEY,
title TEXT NOT NULL, body TEXT NOT NULL, embedding vector(1024) ); -- IVFFlat index for approximate nearest-neighbor -- (HNSW is faster at query time but slower to build) CREATE INDEX meditations_embedding_idx ON meditations USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100); Choose ivfflat to start — cheaper to build. Migrate to hnsw when query latency becomes a priority. Both use vector_cosine_ops.

Embedding on Write // app/Observers/MeditationObserver.php class MeditationObserver { public function
saved(Meditation $model): void { if ($model->wasChanged(['title', 'body'])) { EmbedMeditation::dispatch($model); } } } // app/Jobs/EmbedMeditation.php class EmbedMeditation implements ShouldQueue { public function handle(VoyageClient $voyage): void { $vector = $voyage->embed( $this->meditation->title . "\n\n" . $this->meditation->body ); $this->meditation->update([ 'embedding' => $vector, ]); } } Key Points Idempotency Only re-embed when title or body actually changes. Re- embedding on every save burns money. Queued Job Voyage call is async — doesn't block the HTTP response. Text Prep Concatenate title + body. For long documents, chunk first.

Embedding the Query // app/Services/SearchService.php class SearchService { public function
search(string $query): Collection { $cacheKey = 'embed:' . sha1($query); $vector = Cache::remember( $cacheKey, now()->addHour(), fn() => $this->voyage->embed($query) ); return $this->runVectorQuery($vector); } } Cache Hot Queries Common searches ("stress relief," "morning routine") repeat constantly. Cache their embeddings in Redis. Cache Hit ~0ms embed cost. Skip Voyage entirely. Cache Miss ~80ms Voyage call, then cached for 1 hour.

The Query -- Cosine distance: smaller = more similar --
1 - distance = similarity score (0 to 1) SELECT id, title, 1 - (embedding <=> :query_vector) AS similarity FROM meditations WHERE embedding IS NOT NULL ORDER BY embedding <=> :query_vector LIMIT 10; For the "stress before bed" query, this returns articles on anxiety management, sleep hygiene, evening breathing exercises, and body scan meditations. MySQL FULLTEXT returned zero. Same corpus. Different operator.

SECTION 6 — BEYOND PLAIN SEMANTIC SEARCH Hybrid Search Hybrid
Search Pipeline Query to Embeddings vector embedding search returns ranked list B Reciprocal Rank Fusion merge lists A and B using RRF formula Final Ranked Results combined ranking balances exact and semantic matches Query to Full‑Text tsvector / FULLTEXT returns ranked list A Why Bother? Pure semantic search is weak on exact- match needs: proper nouns, error codes, product SKUs, author names. Reciprocal Rank Fusion Score = 1/(k + rank_bm25) + 1/(k + rank_vector). No tuning required. Works well out of the box. Hybrid search is the production default at most mature search teams. Start semantic-only, add hybrid when precision complaints arrive.

Content Similarity / Recommendations -- "More like this" — one
query -- Use the current article's vector -- instead of a user query vector SELECT id, title, 1 - (embedding <=> :current_embedding) AS similarity FROM meditations WHERE id != :current_id AND embedding IS NOT NULL ORDER BY embedding <=> :current_embedding LIMIT 5; Free Feature You've already done the hard work. Swap the query vector for an article's own embedding and you have "Related Meditations" — at zero additional API cost. Recommendations Surface similar articles at the end of each session. Series / Playlists Group semantically adjacent content automatically.

Deduplication and Clustering Near-Duplicate Detection User-generated content often repeats. Flag
entries with cosine similarity > 0.97 as probable duplicates before storing. Same vector query, threshold applied in WHERE clause. Tag / Topic Drift Over time, editors may apply the same tag to semantically distant content. Clustering reveals when a tag has lost coherence — useful for content audits. Same Vectors, Different Queries All of these features run against the same embedding column. The investment in embedding pays dividends across every content operation.

SECTION 7 — TOPIC DISCOVERY From 1024 Dimensions to a
Screen Human Inspection 2D Point Cloud UMAP Projection 1024-D Embeddings Why Reduce Dimensions? You can't look at 1024-dimensional space. UMAP, t-SNE, and PCA project vectors to 2D while trying to preserve neighborhood relationships. Dimensionality reduction is for human inspection only — never use the reduced coordinates to serve search results. The compression loses information.

The DailyMedToday Themes View Each dot is a meditation. Nearby
dots share semantically similar content. Color encodes auto- detected cluster. One UI — a wealth of editorial signal. This is the "Theme Discovery" admin page: a UMAP projection of all meditation embeddings, rendered as an interactive scatter plot. Built with the same embedding column powering search.

What the Clusters Tell Us 1 Content Gaps Large empty
regions in the projection = topics the archive doesn't cover. Editorial priority backlog, generated automatically. 2 Accidental Over-Coverage Dense clusters = one theme published too many times. Diversify before users notice the repetition. 3 Mislabeled Content Meditations sitting in the wrong cluster despite their assigned tag. Visual outliers flag metadata errors instantly. 4 Series Candidates Tight clusters of 437 sessions with a natural semantic arc 4 ready to package as a curated playlist.

SECTION 8 — REALITY CHECK Pitfalls Model swaps invalidate all
stored vectors Version your embeddings. Store model_version alongside each vector. Migration = re-embed everything. Dimension changes break indexes Pick a dimension once. Changing it requires dropping and rebuilding the column and index. Re-embedding on every edit Debounce. Only embed when content actually changes. Naive observers will surprise you at month-end billing. Vectors don't explain themselves Log the source text that produced each vector. Debugging a bad search result is impossible without it.

When NOT to Bother Tiny Corpus Under ~500 documents, keyword
search with a good synonym list is simpler, faster, and free. Strict-Match Domains Legal citations, drug names, product SKUs, error codes. Exact match is a feature, not a bug. Don't blur it. Tight Latency Budgets External API round-trips add ~80ms. If you can't tolerate that, self-host a model — or reconsider the requirement. Be willing to say no. Embeddings are a tool, not an upgrade every app needs. The honest answer sometimes is "tsvector is fine."

What You Can Ship Monday Swap LIKE → cosine Embed
on save Add vector column Enable pgvector Don't boil the ocean. Ship one working <=> query replacing one LIKE query. Measure recall. Then iterate.

Thank You Questions?

About Me ~20 years in software. PHP 8.3 & 8.4
Release Manager. Monthly security columnist at PHP Architect. O'Reilly author. Founder of Displace Technologies, building infrastructure for open source and critical projects. I help PHP devs bring AI and semantic search into their applications.

Semantic Search and Embeddings: Modernizing PHP...

Semantic Search and Embeddings: Modernizing PHP Search with Vector Databases

Eric Mann

More Decks by Eric Mann

Other Decks in Programming

Featured

Transcript

Semantic Search and Embeddings Modernizing PHP Search with Vector Databases

The Hook "how to deal with stress before bed" MySQL

What This Talk Is A Working Tour Semantic search in

SECTION 2 — THE CASE STUDY Meet DailyMedToday A Laravel

The Stack Web Layer Laravel + PHP-FPM behind Nginx Cache

What People Search For Two kinds of content. Two kinds

Where Keyword Search Fails Three real examples from DailyMedToday's search

SECTION 3 — EMBEDDINGS, INTUITIVELY Meaning over Matching Keyword Search

What an Embedding Actually Is A list of numbers Maybe

Vectors as Coordinates Stress Cluster Calm Cluster Eiffel Tower Lasagna

Cosine Similarity in One Slide 1.0 Identical direction — essentially

Where the Meaning Comes From You're renting learned geometry A

SECTION 4 — CHOOSING THE PIECES Two Decisions, Two Services

The Model: Why Voyage AI Benchmark Quality Strong performance on

Voyage AI at a Glance The kind of comparison table

Storage: pgvector -- One command to enable: CREATE EXTENSION IF

Storage Alternatives The Decision Rule Use the database you already

SECTION 5 — THE PIPELINE The Shape of the System

The Schema CREATE TABLE meditations ( id BIGSERIAL PRIMARY KEY,

Embedding on Write // app/Observers/MeditationObserver.php class MeditationObserver { public function

Embedding the Query // app/Services/SearchService.php class SearchService { public function

The Query -- Cosine distance: smaller = more similar --

SECTION 6 — BEYOND PLAIN SEMANTIC SEARCH Hybrid Search Hybrid

Content Similarity / Recommendations -- "More like this" — one

Deduplication and Clustering Near-Duplicate Detection User-generated content often repeats. Flag

SECTION 7 — TOPIC DISCOVERY From 1024 Dimensions to a

The DailyMedToday Themes View Each dot is a meditation. Nearby

What the Clusters Tell Us 1 Content Gaps Large empty

SECTION 8 — REALITY CHECK Pitfalls Model swaps invalidate all

When NOT to Bother Tiny Corpus Under ~500 documents, keyword

What You Can Ship Monday Swap LIKE → cosine Embed

Thank You Questions?

About Me ~20 years in software. PHP 8.3 & 8.4