Semantic search in Rails using sqlite-vec, Kamal and Docker

I've been reading a lot about semantic search (kNN, BM25, RRF, etc.) both within Elasticsearch and in isolation). With the advent of LLMs and especially open-weights ones mixed with modern hardware, it's a perfect time to approach them in a pragmatic way.

On the blog I'm already using a form of BM25 via SQLite FTS5's extension which enables full text search, similar to pg -search on Postgres. What I wanted to do next was to experiment with embeddings and enable a form of semantic search too.

Grepping the search engines, I've found Alex Garcia 's sqlite-vec :

sqlite-vec, a no-dependency SQLite extension written entirely in C that "runs everywhere" (MacOS, Linux, Windows, WASM in the browser, Raspberry Pis, etc).

The next step was to use a model from Hugging Face, create embeddings for all my articles then enable kNN search by using a local Docker container on my server.

I have tested: EmbeddingGemma , Multilingual-E5-large and *E5-base . As a benchmark, I used my own articles and evaluated the results manually with a sample size of one (myself). Details:

+--------------------------------------------------------------------------+
| MEMORY RATIO MAP (max ratio = 2.07x; lower is better)                    |
+--------------------------------------------------------------------------+
| embeddinggemma-300m / e5-base      1.16x  [###########.........]         |
| e5-large / e5-base                 2.07x  [####################]         |
| e5-large / embeddinggemma-300m     1.78x  [#################...]         |
|                                                                          |
| e5-base --x1.16--> embeddinggemma-300m --x1.78--> e5-large               |
| e5-base ------------------x2.07------------------> e5-large              |
+--------------------------------------------------------------------------+

Memory wise, *E5-large eats the most RSS memory (~4GB), E5-base is the lightest and EmbeddingGemma is right in the middle (~2GB). When it comes to actual results, Gemma returns the same ones as *E5-large, at least according to my empirical results tested on my data.

I did run some benchmarks on latency on my MBP M1 Pro. CPU run, local cache warm, one process: model load + one encode:

+--------------------------------------------------------------------------+
| EMBEDDING MODEL BENCHMARKS (lower is better)                             |
+--------------------------------------------------------------------------+
| intfloat/multilingual-e5-base (768d)                                      |
|   mem:  1.95 GB  [############............]                              |
|   time: ~6.47s   [#####################...]                              |
|--------------------------------------------------------------------------|
| google/embeddinggemma-300m (768d)                                        |
|   mem:  2.26 GB  [##############..........]                              |
|   time: ~6.24s   [####################....]                              |
|--------------------------------------------------------------------------|
| intfloat/multilingual-e5-large (1024d)                                    |
|   mem:  4.03 GB  [########################]                              |
|   time: ~7.37s   [########################]                              |
+--------------------------------------------------------------------------+

I settled on Gemma because it was much lighter than *E5-large while giving similarly good results on my own data.

In order to use this in my Rails blog, I decided the best approach would be to have a separate Docker image for a Gemma tiny service with two API endpoints:

+--------------------------------------------------------------------------+
| EMBEDDING API FLOW                                                       |
+--------------------------------------------------------------------------+
| GET /health                                                              |
|   |                                                                      |
|   v                                                                      |
| 200 OK                                                                   |
|                                                                          |
| POST /embed                                                              |
| { "query": "text to embed" }                                             |
|   |                                                                      |
|   v                                                                      |
| 200 OK                                                                   |
| {                                                                        |
|   "embedding": [0.01, -0.02, ...],                                       |
|   "dimensions": 768,                                                     |
|   "model": "google/embeddinggemma-300m"                                  |
| }                                                                        |
+--------------------------------------------------------------------------+

This will run within the Docker kamal virtual network so it will be accessible by the Rails app. My VM has 6GB of RAM and two Ampere vCPUs i.e. RAM wise should fit with some headroom for the Rails app too.

On the app side, the implementation is fairly simple: add the "sqlite-vec" gem, a route for testing the kNN search /semantic-search and a section in the admin panel where I can trigger the creation of embeddings for my articles.

This is what the Rails flow looks like as an ASCII diagram:

Rails app                     Embedding service
(SemanticSearch + sqlite-vec) (embedding-inference:8765)
---------------------------   ----------------------------
INDEX / UPSERT FLOW
Admin regen
        |
        v
ArticleEmbedding#upsert_for(article)
        |
        +--> available? --no--> trace(skip/error); return false
        |
        +--> PassageBuilder.build_article(article)
        |
        +---------------------------> POST /embed {"query":"passage: ..."}
                                     |
                                     v
                         200 {"embedding":[...],"dimensions":768,"model":"..."}
        <----------------------------+
        |
        +--> validate dims == 768
        +--> delete old row + insert new row in article_embeddings

And this is what happens when you click search on /semantic-search :

GET /semantic-search?query=...
        |
        v
Searcher.search(query)
        |
        +--> PassageBuilder.build_query(query)
        +---------------------------> POST /embed {"query":"query: ..."}
        <---------------------------- 200 {"embedding":[...],"dimensions":768}
        +--> sqlite-vec KNN MATCH [query_vector], k = limit
        +--> join published articles, order by distance ASC
        v
Render ranked semantic results

The full source code for the embedding service can be found in my repo: mp-com-embeddings .

This is still an early experiment, but it already feels like a practical complement to FTS5 rather than a replacement for it. BM25 remains great for exact terms and titles i.e. embeddings help when the wording drifts but the intent stays the same. Next up is a home-brewed RRF implementation plus taking performance seriously this time. Currently I'm just using Rails rate limiter at the controller level to not kill the server but no caching.

Notes

kNN

short for k-nearest neighbors, a way of finding the closest vectors to a given query embedding.

BM25

short for Best Matching 25, a ranking function commonly used in keyword-based full-text search.

RRF

short for Reciprocal Rank Fusion, a simple way to combine multiple ranked result sets, often used in hybrid search setups.

Tagged under: