All posts

The Context Server: Persistent Local ML Inference in Async Rust

Two previous posts covered the context engine's search internals and the chairman orchestrator that queries it. This post is about the thing that makes both of them fast: a persistent HTTP server that loads AllMiniLmL6V2 once and serves hybrid search queries in under 5 milliseconds.

The problem is cold starts. The embedding model — 22 million parameters of libtorch weights — takes 1-2 seconds to load from disk. An AI agent working on the codebase fires hundreds of context queries per session. If every query loads the model, you're spending more time on initialization than inference. The context server eliminates this: start it once, model stays hot, every query after that is just an encode call.

Startup Sequence

The server boots in four steps. Open the LanceDB table from .context_db/. Spawn a dedicated OS thread for the embedding model. Test-encode a throwaway string to confirm the model is ready. Stand up an Axum router on localhost.

context_server.rs — boot sequence
let db = connect_db(&root).await
    .expect("Failed to connect to LanceDB");
let table = db.open_table(TABLE_NAME).execute().await
    .expect("Table not found");

// Spawn dedicated model thread
let (tx, mut rx) = mpsc::channel::<EmbedRequest>(32);

std::thread::spawn(move || {
    let model = SentenceEmbeddingsBuilder::remote(
        SentenceEmbeddingsModelType::AllMiniLmL6V2
    ).create_model().expect("Failed to load embedding model");
    eprintln!("Model loaded. Ready for requests.");

    while let Some(req) = rx.blocking_recv() {
        // process Encode / BatchEncode requests
    }
});

// Wait for model to be ready
let test_handle = ModelHandle { tx: tx.clone() };
let _ = test_handle.encode("test").await;

let app = Router::new()
    .route("/browse", post(handle_browse))
    .route("/fetch", post(handle_fetch))
    .route("/rebuild", post(handle_rebuild))
    .route("/health", get(handle_health))
    .with_state(state);

The test encode is the synchronization point — main() blocks on it until the model thread has loaded and can respond. After that, the server binds to 127.0.0.1:3031 and is ready for queries. Cold to serving in under 3 seconds.

The Model Thread Pattern

This is the most interesting part of the server, and a pattern worth stealing for anyone doing local ML inference in async Rust.

The challenge: rust-bert uses libtorch under the hood. libtorch tensors are not Send or Sync — they can't cross thread boundaries safely and they can't be shared between threads. But our HTTP handlers run on tokio's async runtime, which moves tasks between worker threads freely. You can't hold a model reference in an async handler.

The naive solution — wrapping the model in Arc<Mutex<Model>> — doesn't work because the model isn't Send. Even if it were, you'd serialize all inference behind a single mutex. The correct solution: isolate the model on a dedicated OS thread and communicate via channels.

context_server.rs — EmbedRequest and ModelHandle
enum EmbedRequest {
    Encode {
        text: String,
        reply: oneshot::Sender<Vec<f32>>,
    },
    BatchEncode {
        texts: Vec<String>,
        reply: oneshot::Sender<Vec<Vec<f32>>>,
    },
}

#[derive(Clone)]
struct ModelHandle {
    tx: mpsc::Sender<EmbedRequest>,
}

impl ModelHandle {
    async fn encode(&self, text: &str) -> Vec<f32> {
        let (reply_tx, reply_rx) = oneshot::channel();
        self.tx.send(EmbedRequest::Encode {
            text: text.to_string(),
            reply: reply_tx,
        }).await.expect("Model thread died");
        reply_rx.await.expect("Model thread dropped reply")
    }
}

ModelHandle is Clone — it's just an mpsc::Sender under the hood. Every Axum handler gets its own clone via Arc<AppState>. When a handler needs an embedding, it creates a oneshot channel, sends an EmbedRequest with the reply half, and awaits the response. The handler stays async the entire time. No blocking, no mutex contention.

The model thread itself is a plain std::thread::spawn — not a tokio::spawn. This is deliberate. libtorch's encode() is a blocking CPU operation that can take several milliseconds. If this ran as a tokio task, it would block one of the runtime's worker threads during inference, potentially starving other handlers. On a dedicated OS thread, blocking is fine — that's all the thread does.

context_server.rs — model thread loop
std::thread::spawn(move || {
    let model = SentenceEmbeddingsBuilder::remote(
        SentenceEmbeddingsModelType::AllMiniLmL6V2
    ).create_model().expect("Failed to load embedding model");

    while let Some(req) = rx.blocking_recv() {
        match req {
            EmbedRequest::Encode { text, reply } => {
                let embeddings = model.encode(&[&text])
                    .expect("Failed to encode");
                let _ = reply.send(
                    embeddings[0].as_slice().to_vec()
                );
            }
            EmbedRequest::BatchEncode { texts, reply } => {
                let refs: Vec<&str> = texts.iter()
                    .map(|s| s.as_str()).collect();
                let embeddings = model.encode(&refs)
                    .expect("Failed to batch encode");
                let result: Vec<Vec<f32>> = embeddings.iter()
                    .map(|e| e.as_slice().to_vec()).collect();
                let _ = reply.send(result);
            }
        }
    }
});

The channel capacity is 32. If more than 32 requests queue up, the sender awaits backpressure — Axum handlers pause naturally without error. The model processes requests sequentially, which is correct for CPU inference where parallel encode calls wouldn't benefit from batching anyway. If you had a GPU with dynamic batching, you'd collect from the channel for a few milliseconds before dispatching a batch. For CPU inference on a 22M parameter model, sequential is right.

The let _ = reply.send(...) pattern silently drops the result if the caller has already timed out and dropped their oneshot receiver. No panic, no error — the model just moves on to the next request. This is the correct behavior for a service where callers may have timeout budgets.

Two-Step Discovery: Browse Then Fetch

The server exposes two search endpoints that mirror how an AI agent actually retrieves context.

Browse is the discovery step. You give it a natural language query and it returns ranked section headers — document names, section titles, trigger phrases, relevance scores. No full content. This is intentional: an LLM's context window is finite, and dumping the full text of 15 matching sections would waste most of it on irrelevant paragraphs.

CLI — browse example
$ cargo run -p context-query -- browse --hybrid "model thread pattern"

Matched sections:

  context-server > Server Architecture  (score: 0.042)
    → context server startup boot, model thread pattern, ...
    cluster: Dev & Build Tools

  physics-movement > Architecture  (score: 0.031)
    → Rapier rigidbody physics, player movement, ...
    cluster: Match Simulation Core

The agent scans these results, identifies the relevant hits, then issues a fetch with specific trigger phrases to pull full content. Fetch returns the matched sections plus a graph expansion step — one hop of related context, automatically.

CLI — fetch example
$ cargo run -p context-query -- fetch --hybrid "context server startup boot, model thread pattern"

## context-server > Server Architecture

[full section content with file:line references...]

---

===  RELATED (1 hop)  ===

## search > Hybrid Search  (related)

[overview of related system...]

This two-step pattern keeps the context payload tight. Browse is cheap — a handful of short strings. Fetch is targeted — only the sections the agent actually needs, plus their immediate neighborhood in the document graph.

Hybrid Search: BM25 + Vectors

Each query runs two search paths simultaneously on the same Lance table. BM25 full-text search matches exact terms in trigger phrases — if you search for "EmbedRequest", it finds sections where that exact token appears. Vector similarity search encodes your query with the same AllMiniLmL6V2 model and finds sections whose trigger embeddings are nearby in 384-dimensional space — so "how does the model thread work" matches sections about channel-based inference even if those exact words don't appear in the triggers.

Reciprocal Rank Fusion (k=60) merges the two ranked lists. A section ranked #1 in BM25 and #3 in vector search gets a combined score of 1/61 + 1/63. A section appearing in only one list gets roughly half that score. The previous deep dive post covers the math in detail — the point here is that the server handles both retrieval paths in a single request, with the model thread doing the vector encoding inline.

Graph Expansion

When fetch collects direct hits, it also reads their related_docs and related_sections fields — edges in the document graph. For each related document that wasn't already a direct hit, the server queries the Lance table for its Overview section (section_idx = 0).

search.rs — graph expansion query
let expand_docs: Vec<String> = related_doc_names
    .difference(&hit_doc_names)
    .cloned()
    .collect();

let filter = format!(
    "doc_name IN ({}) AND section_idx = 0",
    in_list
);

table.query()
    .only_if(filter)
    .select(Select::columns(&[
        "doc_name", "section_title", "content"
    ]))
    .execute()
    .await

No separate graph database. The edges are stored as comma-separated strings in the Lance table itself. The expansion is a SQL filter on the same table — one query, no joins, no external service. Fetching only Overview sections keeps the expansion bounded: you get enough context to understand what a related system does, not its full documentation.

Hot Rebuild

The /rebuild endpoint re-reads all JSON context documents from docs/context/, batch-encodes their triggers and content through the model thread, recreates the Lance table, and hot-swaps it behind an RwLock. No server restart.

context_server.rs — hot rebuild
async fn handle_rebuild(State(state): State<Arc<AppState>>) -> &'static str {
    let (mut rows, doc_count) =
        claude_tools::context_doc_loader::load_section_rows(&root);

    // Batch encode triggers
    let triggers_texts: Vec<String> = rows.iter()
        .map(|r| r.triggers.replace('\n', " ")).collect();
    let triggers_embeddings =
        state.model_handle.batch_encode(triggers_texts).await;

    // Batch encode content
    let content_texts: Vec<String> = rows.iter()
        .map(|r| r.content.chars().take(2000).collect()).collect();
    let content_embeddings =
        state.model_handle.batch_encode(content_texts).await;

    // ... assign embeddings to rows ...

    let new_table =
        claude_tools::db::create_table_from_rows(&db, &rows).await?;

    let mut table = state.table.write().await;
    *table = new_table;
}

The RwLock is the key. Browse and fetch handlers take read locks — they can run concurrently without blocking each other. Rebuild takes a write lock only for the final swap, which is a pointer assignment. The window where reads are blocked is microseconds. During the rest of the rebuild — reading JSON, encoding embeddings, creating the new table — the server continues serving queries from the old table.

Triggered by a single command: cargo run -p context-query -- rebuild. For our current corpus of ~170 documents and ~1,270 sections, a full rebuild takes a few seconds. The batch encoding dominates — the model thread processes all trigger and content texts sequentially through two batch_encode calls.

Why Local, Not an API

AllMiniLmL6V2 is a 22 million parameter model. It loads in under 2 seconds on a laptop CPU. A single encode call returns in under 5 milliseconds. For our use case — a developer tool that serves a local AI agent pipeline — this was clearly the right choice over an embedding API.

No API keys to manage. No network latency added to every query. No per-token cost that scales with usage. No external service to depend on or worry about rate limits. The model runs on the same machine as the agent, the database is a directory on disk, and the server is a single Rust binary.

This doesn't generalize to every embedding use case. If you need multilingual embeddings, or your corpus has millions of documents, or you're serving a distributed fleet of clients, a hosted API with a larger model makes sense. But for a local development tool querying ~1,200 sections, 22M parameters on localhost is more than enough. The entire server — model loading, HTTP routing, search, rebuild — is under 300 lines of Rust.

The Compound Effect

The context server is the memory layer for an autonomous development pipeline. The chairman orchestrator queries it to detect which codebase systems a feature touches. Research agents write new context documents and trigger rebuilds — the server picks up the new knowledge without restarting. Every document added makes the next query smarter.

It's a small piece of infrastructure — an Axum server, a model thread, a LanceDB table — but it's the piece that makes everything else fast. The chairman pattern works because context queries return in milliseconds instead of seconds. The TDD executor works because it can browse for test patterns before writing code. The whole pipeline works because there's a persistent process holding a warm model, ready to turn natural language into vectors on demand.

This is what lets us ship football simulation features faster. Not a bigger model, not more agents — just a localhost server that loads once and remembers everything.