Semantic search in Rails using sqlite-vec, Kamal and Docker

I’ve been reading a lot about search lately: kNN, BM25, RRF and related ideas, both within Elasticsearch and in isolation. With the advent of LLMs, especially open-weight models, plus modern hardware, it feels like a good time to approach some of this in a pragmatic way.

On the blog I'm already using a form of BM25 via SQLite’s FTS5 extension which enables full text search, similar to the pg_search gem which uses textsearch on Postgres.

What I wanted to do next was to experiment with embeddings and enable a form of semantic search too.

If you want to try it out, it’s live at /semantic-search

While searching around, I came across Alex Garcia's sqlite-vec. He describes it as “a no-dependency SQLite extension written entirely in C that runs everywhere”. That was appealing in itself since I did not want to introduce a separate client/server vector database for a small Rails app. The tradeoff is that it is brute-force search only, with “no ANN indexes (yet!)”, but at my scale that felt perfectly reasonable.

The next step was to use a model from Hugging Face, create embeddings for all my articles then enable kNN search by using a local Docker container on my server.

Benchmarks

I tested: EmbeddingGemma , Multilingual-E5-large and *E5-base . As a benchmark, I used my own articles and evaluated the results manually with a sample size of one: myself.

+--------------------------------------------------------------------------+
| MEMORY RATIO MAP (max ratio = 2.07x; lower is better)                    |
+--------------------------------------------------------------------------+
| embeddinggemma-300m / e5-base      1.16x  [###########.........]         |
| e5-large / e5-base                 2.07x  [####################]         |
| e5-large / embeddinggemma-300m     1.78x  [#################...]         |
|                                                                          |
| e5-base --x1.16--> embeddinggemma-300m --x1.78--> e5-large               |
| e5-base ------------------x2.07------------------> e5-large              |
+--------------------------------------------------------------------------+

Memory wise, *E5-large eats the most RSS memory (~4GB), E5-base is the lightest and EmbeddingGemma is right in the middle (~2GB). When it comes to actual results, Gemma returned similar results as *E5-large, at least according to my empirical results tested on my data.

I ran some benchmarks on latency on my MBP M1 Pro. CPU run, cold cache (the model was loaded into memory on each run), one process: model load + one encode:

+--------------------------------------------------------------------------+
| EMBEDDING MODEL BENCHMARKS (lower is better)                             |
+--------------------------------------------------------------------------+
| intfloat/multilingual-e5-base (768d)                                      |
|   mem:  1.95 GB  [############............]                              |
|   time: ~6.47s   [#####################...]                              |
|--------------------------------------------------------------------------|
| google/embeddinggemma-300m (768d)                                        |
|   mem:  2.26 GB  [##############..........]                              |
|   time: ~6.24s   [####################....]                              |
|--------------------------------------------------------------------------|
| intfloat/multilingual-e5-large (1024d)                                    |
|   mem:  4.03 GB  [########################]                              |
|   time: ~7.37s   [########################]                              |
+--------------------------------------------------------------------------+

I settled on Gemma because it was much lighter than *E5-large while giving similarly good results on my own data.

Rails implementation

In order to use this in my Rails blog, I decided the best approach would be to have a separate Docker image for a Gemma tiny service with two API endpoints:

+--------------------------------------------------------------------------+
| EMBEDDING API FLOW                                                       |
+--------------------------------------------------------------------------+
| GET /health                                                              |
|   |                                                                      |
|   v                                                                      |
| 200 OK                                                                   |
|                                                                          |
| POST /embed                                                              |
| { "query": "text to embed" }                                             |
|   |                                                                      |
|   v                                                                      |
| 200 OK                                                                   |
| {                                                                        |
|   "embedding": [0.01, -0.02, ...],                                       |
|   "dimensions": 768,                                                     |
|   "model": "google/embeddinggemma-300m"                                  |
| }                                                                        |
+--------------------------------------------------------------------------+

This runs within the Docker kamal virtual network so it will be accessible by the Rails app. My VM has 6GB of RAM and two Ampere vCPUs i.e. RAM wise, it should fit with some headroom for the Rails app too.

The current performance on the VM looks like this:

When    Article ID  Operation   Source  Status  Duration    Dims    Bytes   Model   Error
2026-03-29 23:30:03 46  upsert  callback    success 3368ms  768 8819    google/embeddinggemma-300m  -
2026-03-29 23:28:12 46  upsert  callback    success 3428ms  768 8819    google/embeddinggemma-300m  -
...
etc.
+--------------------------------------------------------------------------+
| EMBEDDING LATENCY HISTOGRAM (ms)                                         |
+--------------------------------------------------------------------------+
|   0-999    | [##......................]  2                               |
| 1000-1999  | [##################......] 18                               |
| 2000-2999  | [##########..............] 10                               |
| 3000-3999  | [###############.........] 15                               |
| 4000-4999  | [####....................]  4                               |
| 5000-6999  | [#.......................]  1                               |
+--------------------------------------------------------------------------+

The median response time for the API is ~2726 ms, RAM: ~2.5GB in use by both the embedding service and two Puma workers. Given this is a really low-spec VM, I'm quite happy with the experiment so far given there's no cache at all in the hot path.

On the app side, the implementation is fairly simple: add the "sqlite-vec" gem, a route for testing the kNN search /semantic-search and a section in the admin panel where I can trigger the creation of embeddings for my articles.

This is what the Rails flow looks like as an ASCII diagram:

Rails app                     Embedding service
(SemanticSearch + sqlite-vec) (embedding-inference:8765)
---------------------------   ----------------------------
INDEX / UPSERT FLOW
Admin regen
        |
        v
ArticleEmbedding#upsert_for(article)
        |
        +--> available? --no--> trace(skip/error); return false
        |
        +--> PassageBuilder.build_article(article)
        |
        +---------------------------> POST /embed {"query":"passage: ..."}
                                     |
                                     v
                         200 {"embedding":[...],"dimensions":768,"model":"..."}
        <----------------------------+
        |
        +--> validate dims == 768
        +--> delete old row + insert new row in article_embeddings

And this is what happens when you click search on /semantic-search :

GET /semantic-search?query=...
        |
        v
Searcher.search(query)
        |
        +--> PassageBuilder.build_query(query)
        +---------------------------> POST /embed {"query":"query: ..."}
        <---------------------------- 200 {"embedding":[...],"dimensions":768}
        +--> sqlite-vec KNN MATCH [query_vector], k = limit
        +--> join published articles, order by distance ASC
        v
Render ranked semantic results

The embedding service

The full source code for the embedding service can be found in my repo: mp-com-embeddings .

It runs on Python/Torch and uses the sentence_transformers library, small excerpt from the POC:

def load_model(model_name: str, token: str | None, verbose: bool) -> SentenceTransformer:
    kwargs: dict[str, Any] = {}
    if token:
        kwargs["token"] = token

    return SentenceTransformer(model_name, local_files_only=True, **kwargs)

This is still an early experiment, but it already feels like a practical complement to FTS5 rather than a replacement for it.

BM25 remains great for exact terms and titles, while embeddings help when the wording drifts but the intent stays the same. Next up is a home-brewed RRF implementation, plus taking performance seriously this time.

Notes

  • kNN - k-nearest neighbors, a way of finding the closest vectors to a given query embedding.
  • BM25 - Best Matching 25, a ranking function commonly used in keyword-based full-text search.
  • RRF - Reciprocal Rank Fusion, a simple way to combine multiple ranked result sets, often used in hybrid search setups.

Tagged under: