← All insights
AI Engineering · September 15, 2024 · 8 min read

Building an AI-Powered Product Recommendation System

Recommendation systems are one of the highest-leverage places to add AI to an existing product. Done well, they drive measurable uplift in engagement and revenue. Done poorly, they surface irrelevant content, erode trust, and create a maintenance burden that outweighs the benefit.

This post walks through the architecture we use for production recommendation systems — the kind that improves over time rather than degrading silently.

The baseline problem with most recommendation systems

Most teams implement recommendations by starting with collaborative filtering (users who liked X also liked Y) or simple popularity rankings. These work at first. They stop working when:

  • The catalog is large and sparse (most items have few interactions)
  • The user base is heterogeneous (new users vs. long-term users need different signals)
  • The catalog changes frequently (new items have no interaction history)
  • There's no feedback loop — the system can't learn from what it got wrong

Embeddings solve most of these problems. The key insight: represent both users and items in the same vector space, then recommendations become nearest-neighbour lookups.

Architecture overview

The system has four components:

  • Embedding service — converts items and user behaviour into dense vectors
  • Vector store — stores and indexes embeddings for fast similarity search (we use Pinecone in production, pgvector works fine at smaller scale)
  • Retrieval layer — given a user context, fetches candidate items via ANN search
  • Re-ranking layer — applies business rules and freshness signals to the candidate set before returning final recommendations

Building the embedding service

For product recommendations, we embed item content (title, description, category, attributes) using a pre-trained sentence transformer. OpenAI's text-embedding-3-small is cost-effective and performant for most use cases. For domain-specific catalogs (medical, legal, technical), fine-tuning or a domain-specific model often outperforms general embeddings.

User embeddings are computed from interaction history: the weighted average of the item embeddings the user has interacted with, weighted by recency and engagement signal (click < add-to-cart < purchase < review).

The feedback loop

This is the part most teams skip. A recommendation system without a feedback loop will degrade over time as the catalog evolves and user preferences shift. You need:

  • Impression logging — what did you show, to whom, when
  • Click logging — what was clicked from a recommendation surface
  • Conversion logging — what downstream action followed
  • A scheduled job that retrains or updates embeddings using recent interaction data

The retraining frequency depends on catalog velocity. For a slow-moving catalog, weekly retraining is fine. For a fast-moving catalog (daily new items), you need online updates — embedding new items immediately and indexing them into the vector store without a full rebuild.

Re-ranking for business logic

Pure semantic similarity isn't always what you want to surface. A recommendation system that only optimises for similarity will over-index on safe choices. The re-ranking layer is where business rules live:

  • Freshness decay — boost newer items, reduce weight of items older than X days
  • Inventory filtering — don't recommend out-of-stock items
  • Diversity enforcement — avoid showing 10 variations of the same item
  • Business promotions — allow manual boosts for specific items

Keeping business logic in the re-ranking layer (rather than baked into the model) means it can be updated without retraining.

What to instrument from day one

Before you go live, instrument these metrics:

  • Click-through rate per recommendation surface
  • Conversion rate from recommendation click
  • Coverage — what percentage of your catalog ever appears in recommendations
  • Diversity — average pairwise distance between items in a recommendation set

Coverage and diversity are the early warning signals for the most common failure modes: a system that concentrates recommendations on a small subset of popular items, and a system that returns repetitive results.

The short version

Embeddings + vector search is the right foundation. Build the feedback loop before you launch, not after. Keep business logic in re-ranking, not in the model. Instrument coverage and diversity from day one.

Still reading? Good. Book a 30-minute call.

No sales pitch. We'll ask what's on fire and tell you if we can help. If we can't, we'll name three firms who can.

Book a call →