All posts

Context Engine Technical Deep Dive: Rust, LanceDB, BERT, and Hybrid Search

This post is a technical companion to the LanceDB Context Engine overview. Here I'll walk through the actual Rust crate dependencies, the embedding model choices and trade-offs, the Arrow-based schema design, and the hybrid search implementation — with real code from the codebase.

The Rust Dependency Stack

The context engine lives in the claude-tools crate under dev-tools/. The core dependencies tell the story of the architecture.

Cargo.toml — key dependencies
lancedb = "0.23.1"        # Embedded vector + columnar DB
arrow-schema = "56.2.0"    # Arrow schema definitions
arrow-array = "56.2.0"     # Arrow array types (StringArray, Float32Array, etc.)
rust-bert = "0.23"         # Transformer models via libtorch
tch = "0.17"               # Rust bindings to PyTorch C++ (libtorch)
axum = "0.8"               # HTTP server for persistent model serving
tokio = { version = "1", features = ["rt-multi-thread"] }
futures = "0.3"            # Stream processing for Lance query results

LanceDB is the centerpiece — an embedded columnar database built on Apache Arrow that supports both full-text search (BM25) and vector similarity search natively, with built-in reranking. No external services, no Docker containers, no network calls. The database compiles to a .lance directory on disk.

rust-bert + tch provide the embedding model. rust-bert wraps HuggingFace transformer architectures in Rust, and tch is the raw libtorch binding. The model weights are downloaded on first run and cached locally.

Embedding Model: AllMiniLmL6V2

We use sentence-transformers/all-MiniLM-L6-v2 — a 22M parameter distilled model that produces 384-dimensional embeddings. The model choice was deliberate.

Why AllMiniLmL6V2 over alternatives: It loads in under 2 seconds on CPU. The 384-dimension output keeps the Lance table compact — at 165 docs with ~1,246 sections, the entire vector index fits in memory. Larger models like all-mpnet-base-v2 (768d) or e5-large (1024d) would give marginally better semantic matching but double or triple the index size and load time for minimal practical gain at our document count.

The model is loaded once via rust-bert's pipeline API.

context_build.rs — model loading
use rust_bert::pipelines::sentence_embeddings::{
    SentenceEmbeddingsBuilder,
    SentenceEmbeddingsModelType,
};

let model = SentenceEmbeddingsBuilder::remote(
    SentenceEmbeddingsModelType::AllMiniLmL6V2
)
    .create_model()
    .expect("Failed to load embedding model");

Each row in the Lance table gets two embeddings: one for its triggers, one for its content. Rather than calling model.encode() per row (2,492 calls for 1,246 sections), we collect all trigger texts into one vector and all content texts into another, then encode each in a single batch call — two calls total. Batched inference amortizes GPU kernel launch overhead and libtorch dispatch cost across all inputs at once.

context_build.rs — batch embedding
fn compute_embeddings(model: &SentenceEmbeddingsModel, rows: &mut [SectionRow]) {
    let triggers_texts: Vec<String> = rows.iter()
        .map(|r| r.triggers.replace('\n', " "))
        .collect();

    let content_texts: Vec<String> = rows.iter()
        .map(|r| r.content.chars().take(2000).collect())
        .collect();

    let triggers_embeddings = model.encode(&triggers_texts)
        .expect("Failed to encode triggers");
    let content_embeddings = model.encode(&content_texts)
        .expect("Failed to encode content");

    for (i, row) in rows.iter_mut().enumerate() {
        row.triggers_embedding = triggers_embeddings[i].as_slice().to_vec();
        row.content_embedding = content_embeddings[i].as_slice().to_vec();
    }
}

Content is truncated to 2,000 characters before embedding — roughly 512 tokens, which is the model's effective window. Longer text gets diminishing returns from mean pooling.

Arrow Schema and LanceDB Table

LanceDB is built on Apache Arrow, so the table schema is defined with Arrow types. Each row is one section of one context document.

Apache Arrow is a columnar memory format designed for analytics and data processing. Instead of storing data row-by-row (like JSON or SQLite), Arrow stores each column as a contiguous array — all doc_names together, all triggers together, all embeddings together. This layout means scanning a single column (e.g. filtering by cluster) reads sequential memory with zero deserialization. LanceDB inherits this: queries that only touch 3 of 15 columns never load the other 12 from disk. The Rust crates arrow-schema and arrow-array give us type-safe schema definitions and zero-copy array construction.

db.rs — schema definition
pub const EMBEDDING_DIM: i32 = 384;

pub fn context_schema() -> Schema {
    Schema::new(vec![
        Field::new("id", DataType::Utf8, false),
        Field::new("doc_name", DataType::Utf8, false),
        Field::new("section_title", DataType::Utf8, false),
        Field::new("section_idx", DataType::UInt32, false),
        Field::new("cluster", DataType::Utf8, false),
        Field::new("triggers", DataType::Utf8, false),
        Field::new("content", DataType::Utf8, false),
        Field::new("related_docs", DataType::Utf8, false),
        Field::new("related_sections", DataType::Utf8, false),
        Field::new(
            "triggers_embedding",
            DataType::FixedSizeList(
                Arc::new(Field::new("item", DataType::Float32, true)),
                EMBEDDING_DIM,
            ),
            false,
        ),
    ])
}

Embeddings are stored as FixedSizeList of 384 Float32 values — Arrow's native representation for fixed-length vectors. This lets LanceDB build vector indexes directly on the column without serialization overhead.

Three indexes are created at build time: BM25 full-text search on triggers, BM25 on content, and an auto-configured vector index on triggers_embedding.

db.rs — index creation
// BM25 full-text search on trigger phrases
table.create_index(
    &["triggers"],
    Index::FTS(FtsIndexBuilder::default().with_position(true)),
).execute().await?;

// BM25 on prose content
table.create_index(
    &["content"],
    Index::FTS(FtsIndexBuilder::default().with_position(true)),
).execute().await?;

// Vector ANN index on trigger embeddings
table.create_index(
    &["triggers_embedding"],
    Index::Auto,
).execute().await?;
  • with_position(true) — Enables positional indexing for FTS. This supports phrase queries and proximity boosting in BM25 scoring — word order matters, not just word presence.
  • Index::FTS — Creates a BM25 full-text search index. We build two: one on trigger phrases (primary search target) and one on prose content (fallback).
  • Index::Auto — Lets LanceDB choose the optimal vector index type based on dataset size — IVF_PQ for large sets, flat scan for small. At ~1,200 rows, it picks flat, which is exact nearest neighbor with no approximation loss.

Hybrid Search: BM25 + Vector + RRF

The core search function runs both BM25 and vector search simultaneously, then merges with Reciprocal Rank Fusion.

search.rs — hybrid browse
use lancedb::rerankers::rrf::RRFReranker;

pub async fn browse_with_embedding(
    table: &lancedb::Table,
    query: &str,
    embedding: Option<&[f32]>,
) -> Vec<BrowseResult> {
    let fts_query = FullTextSearchQuery::new(query.to_string())
        .with_column("triggers".to_string())
        .expect("valid column");

    let stream_result = if let Some(emb) = embedding {
        // Hybrid: BM25 + vector + RRF reranking
        let reranker = Arc::new(RRFReranker::default()); // k=60
        table.query()
            .nearest_to(emb).expect("valid vector")
            .column("triggers_embedding")
            .full_text_search(fts_query)
            .select(Select::columns(&[
                "doc_name", "section_title",
                "triggers", "cluster",
            ]))
            .rerank(reranker)
            .limit(15)
            .execute_hybrid(Default::default())
            .await
    } else {
        // FTS-only: pure BM25, no model needed
        table.query()
            .full_text_search(fts_query)
            .select(Select::columns(&[...]))
            .limit(15)
            .execute()
            .await
    };
}

RRF with k=60 means each result's score is 1/(k + rank). A section ranked #1 in BM25 and #3 in vector search gets score 1/61 + 1/63 = 0.0323. A section ranked #1 in only one list gets 1/61 = 0.0164. Appearing in both lists roughly doubles your score — this is how hybrid search surfaces results that match both semantically and lexically.

Graph Expansion on Fetch

After direct hits are collected, the fetch function follows related_docs and related_sections edges to pull in neighboring context — one hop of graph expansion, no separate graph database.

search.rs — graph expansion
// Collect related doc names from direct hits
let expand_docs: Vec<String> = related_doc_names
    .difference(&hit_doc_names)
    .cloned()
    .collect();

// Fetch Overview (section_idx=0) of each related doc
let in_list = expand_docs.iter()
    .map(|d| format!("'{}'", d.replace('\'', "''")))
    .collect::<Vec<_>>()
    .join(", ");
let filter = format!(
    "doc_name IN ({}) AND section_idx = 0",
    in_list
);

table.query()
    .only_if(filter)
    .select(Select::columns(&["doc_name", "section_title", "content"]))
    .execute()
    .await

This is just a SQL filter on the same Lance table — no separate graph store. Each context doc has a related field listing connected docs, and each section can list related sections. The expansion only fetches Overview sections (index 0) to keep the context payload bounded.

Server Architecture: Model Thread Isolation

The persistent server (context_server) solves a practical problem: rust-bert models use libtorch tensors that aren't Send + Sync. The solution is a dedicated model thread communicating via channels.

context_server.rs — model thread pattern
enum EmbedRequest {
    Encode {
        text: String,
        reply: oneshot::Sender<Vec<f32>>,
    },
}

struct ModelHandle {
    tx: mpsc::Sender<EmbedRequest>,
}

impl ModelHandle {
    async fn encode(&self, text: &str) -> Vec<f32> {
        let (reply_tx, reply_rx) = oneshot::channel();
        self.tx.send(EmbedRequest::Encode {
            text: text.to_string(),
            reply: reply_tx,
        }).await.expect("Model thread died");
        reply_rx.await.expect("Model thread dropped reply")
    }
}

// Spawned as std::thread (not tokio task)
std::thread::spawn(move || {
    let model = SentenceEmbeddingsBuilder::remote(
        SentenceEmbeddingsModelType::AllMiniLmL6V2
    ).create_model().expect("Failed to load");

    while let Some(req) = rx.blocking_recv() {
        match req {
            EmbedRequest::Encode { text, reply } => {
                let embeddings = model.encode(&[&text])
                    .expect("Failed to encode");
                let _ = reply.send(
                    embeddings[0].as_slice().to_vec()
                );
            }
        }
    }
});

The model lives on a plain std::thread, not a tokio task — libtorch's blocking operations would starve the async runtime. Axum handlers send encode requests through an mpsc channel and await the oneshot reply. The model loads once on startup, then every query is just an encode call — typically under 5ms for a single sentence.

Alternative Approaches Considered

Before settling on this stack, we evaluated several alternatives. SQLite + sqlite-vss was considered for the vector search, but it requires separate FTS5 setup and has no built-in hybrid reranking — we'd have to implement RRF manually. Qdrant and Milvus are purpose-built vector databases, but they run as separate services with network overhead, which is unnecessary for a 1,246-row dataset.

For embedding models, OpenAI's text-embedding-3-small (1536d) would give better semantic quality but requires API calls — adding latency, cost, and a network dependency to every query. For a developer tool that runs hundreds of queries per session, local inference is non-negotiable. The ONNX runtime via ort crate was considered as an alternative to libtorch — it has smaller binaries and better cross-platform support, but rust-bert's pipeline API made the initial implementation faster.

The current stack hits the right trade-offs for our scale: ~1,200 sections, ~6,000 triggers, sub-millisecond BM25 queries, under 10ms hybrid queries with the server running. LanceDB gives us columnar storage, FTS, vector search, and RRF reranking in a single embedded dependency with zero operational overhead.