Skip to content

Catalog Search in the Typed Cache

How search_items and search_customers actually work, end to end — from the MCP tool surface down to SQLite triggers. Read architecture.md first for the structural overview; this recipe focuses on the behavior a contributor needs when adding a search-backed tool, debugging a "0 results" report, or understanding what changed in #472.

The 30-second summary

search_items({"query": "00.7018.581.003"})
  └─► services.typed_cache.catalog.smart_search(CachedMaterial, "00.7018.581.003")
        ├─► tokenize via re.split(r"\W+", ...)           →  ["00", "7018", "581", "003"]
        ├─► FTS5 MATCH on material_fts                   →  "00"* AND "7018"* AND "581"* AND "003"*
        │   (matches material_fts.supplier_item_codes_text)
        └─► no hits or fts5: syntax error                →  fall through to search_fuzzy
                                                            (difflib via helpers/search.score_match)

Per-entity FTS5 virtual tables (variant_fts, customer_fts, …) sit alongside the typed cache tables and are kept in lock-step via SQLite triggers. The CatalogQueries adapter at services.typed_cache.catalog exposes typed read methods returning Cached* SQLModel instances with default include_archived=False / include_deleted=False filters.

Smart_search internals

CatalogQueries.smart_search(cls, query, ...) is the entry point. Implementation lives at katana_mcp_server/src/katana_mcp/typed_cache/queries.py.

Tokenizer

_TOKEN_RE = re.compile(r"\W+")

def _tokenize_query(query: str) -> list[str]:
    return [tok for tok in _TOKEN_RE.split(query.strip()) if tok]

The legacy CatalogCache used query.split(), which only splits on whitespace. For a SKU-shaped query like 00.4021.018.003 that produced a single token. FTS5's default unicode61 tokenizer would split the indexed content on dots — so the query token never lined up with any index token, and the search returned zero results.

Splitting on \W+ mirrors how FTS5 tokenized the content, so query tokens line up with index tokens. This is the #471 / #473 fix that #472 absorbed.

MATCH expression

Each token becomes a prefix-match clause, ANDed:

tokens = ["acme", "part"]           →  "acme"* AND "part"*
tokens = ["00", "7018", "581"]      →  "00"* AND "7018"* AND "581"*

The implementation in _build_fts_match double-quote-escapes each token to protect against tokens that happen to look like FTS5 keywords (AND, NEAR, etc.). Ranking is BM25 (ORDER BY bm25(<entity>_fts)), highest signal first.

Fall-through to fuzzy

try:
    async with self._engine.session() as session:
        conn = await session.connection()
        cursor = await conn.exec_driver_sql(sql, (fts_match, limit))
        ids = [int(row[0]) for row in cursor.all()]
except OperationalError as exc:
    if not _is_fts_syntax_error(exc):
        raise  # genuine OperationalError (locked DB, missing table) — propagate
    return await _fuzzy()  # FTS5 syntax error → fuzzy
if not ids:
    return await _fuzzy()  # FTS5 returned no hits → fuzzy

Two conditions trigger the fuzzy fall-through:

  1. FTS5 returned zero hits. Common for partial matches and typos. Note that queries that tokenize to an empty list (whitespace-only or punctuation-only input) return [] directly without fuzzy — there's nothing to match against.
  2. FTS5 raised an OperationalError whose underlying orig message starts with fts5: syntax error or fts5: unknown. This is narrow on purpose — other OperationalErrors (locked DB, missing FTS table, disk I/O) propagate so operators see real failures instead of degraded-but-silent search.

A third path exists: entities without an FTS sidecar (the lookup-only types listed in the coverage table below) skip FTS entirely and go straight to fuzzy. smart_search on CachedTaxRate, for example, has no tax_rate_fts table to query — fuzzy is the only available path.

The narrowing is done via str(exc.orig).startswith(...) (with a defensive fallback to str(exc) if exc.orig happens to be None, which SQLAlchemy shouldn't produce in practice but is guarded for safety). str(exc) on its own would include SQLAlchemy's wrapper decoration (builtins.Exception) ... [SQL: ...] and make the prefix match unreliable.

Fuzzy scoring

search_fuzzy loads candidate rows and ranks them via katana_public_api_client.helpers.search.score_match — the same difflib-based scorer the legacy cache used. Field weights match the legacy convention:

Field role Weight
sku (variants) 100
Primary name 30
Secondary descriptor 20

(_FUZZY_WEIGHT_SKU / _FUZZY_WEIGHT_PRIMARY / _FUZZY_WEIGHT_SECONDARY in queries.py.) Only CachedVariant carries a SKU column; for everything else the SKU bucket stays empty and ranking falls back to the primary/secondary weighting.

The FTS5 sidecar

katana_mcp_server/src/katana_mcp/typed_cache/fts.py is the source-of-truth for the FTS lifecycle. Two responsibilities: creating the virtual tables and keeping them in sync with their content tables.

Per-entity virtual tables

Each Cached* class with FTS coverage declares a class-level __fts_columns__: ClassVar[tuple[str, ...]] injected by the generator (inject_fts_columns pass driven by CACHE_FTS_SPECS in scripts/generate_pydantic_models.py). At engine open, _create_fts_tables_ddl walks the registered subclasses and emits:

CREATE VIRTUAL TABLE IF NOT EXISTS variant_fts USING fts5(
    sku, display_name, parent_name,
    supplier_item_codes_text, internal_barcode, registered_barcode,
    content='variant', content_rowid='id'
);

content='<table>' content_rowid='id' makes this an external-content FTS5 table — it stores the inverted index but not the columns themselves. The index points back at the content table by rowid = id. That's why every FTS-enabled cache class needs an id primary key.

Coverage today (CACHE_FTS_SPECS):

Class FTS columns
CachedVariant sku, display_name, parent_name, supplier_item_codes_text, internal_barcode, registered_barcode
CachedProduct name, category_name
CachedMaterial name, category_name
CachedService name
CachedCustomer name, email, phone
CachedSupplier name, email, phone

Lookup-only types (CachedLocation, CachedTaxRate, CachedOperator, CachedFactory, CachedAdditionalCost) skip FTS — get_all + score_match fuzzy is plenty for catalogs that small.

Why SQLite triggers, not SQLAlchemy mapper events

The original Phase B spike wired after_insert / after_update / after_delete mapper events. CI was green. A reviewer caught the bug: the typed-cache sync path uses _bulk_upsert (typed_cache/sync.py), which issues INSERT ... ON CONFLICT via SQLAlchemy Core — not ORM-level session.add(). Core statements do not fire mapper events.

So every fresh MCP server startup followed by its first sync would leave the FTS inverted index empty. smart_search would silently return zero results until a manual rebuild ran. In tests this was hidden because the test fixtures used session.add() (the ORM path) and never exercised cold-start through the bulk path.

The fix is SQLite triggers on the content tables:

CREATE TRIGGER variant_ai AFTER INSERT ON variant BEGIN
    INSERT INTO variant_fts(rowid, sku, display_name, ...)
    VALUES (new.id, new.sku, new.display_name, ...);
END;

CREATE TRIGGER variant_au AFTER UPDATE ON variant BEGIN
    INSERT INTO variant_fts(variant_fts, rowid, ...) VALUES('delete', old.id, ...);
    INSERT INTO variant_fts(rowid, sku, display_name, ...)
    VALUES (new.id, new.sku, new.display_name, ...);
END;

CREATE TRIGGER variant_ad AFTER DELETE ON variant BEGIN
    INSERT INTO variant_fts(variant_fts, rowid, ...) VALUES('delete', old.id, ...);
END;

Triggers fire for every write mode SQLite supports — ORM session.add(), Core INSERT ... ON CONFLICT, raw exec_driver_sql, even a future migration helper issuing text(...) SQL. That's the SQLite-recommended pattern for external-content FTS5 and is what the typed cache uses now.

The regression test that would have caught the original bug is test_ensure_variants_synced_populates_fts_index in katana_mcp_server/tests/test_typed_cache_catalog.py — it cold-starts the cache via the bulk-upsert path and asserts that smart_search returns the seeded rows. Add an equivalent test whenever you wire a new entity into the FTS sidecar.

Variant denormalization

FTS5 can't JOIN. The variant_fts virtual table sees only the columns on CachedVariant itself — so any field a caller would reasonably search by needs to live there directly. The wire Variant schema doesn't carry several of those fields; the typed-cache sync lifts them via EntitySpec.attrs_postprocess.

The relevant hook is _variant_postprocess in typed_cache/sync.py:

def _variant_postprocess(attrs_obj: Any, cache_row: CachedVariant) -> None:
    parent = unwrap_unset(attrs_obj.product_or_material, None)
    parent_name: str | None = None
    sku = unwrap_unset(attrs_obj.sku, None)

    if parent is not None:
        cache_row.parent_archived_at = unwrap_unset(
            getattr(parent, "archived_at", None), None
        )
        parent_name = unwrap_unset(getattr(parent, "name", None), None)
        cache_row.parent_name = parent_name

    display_parts = [parent_name] if parent_name else [sku] if sku else []
    # ...append variant config_attributes...
    cache_row.display_name = " / ".join(display_parts) if display_parts else None

    codes = unwrap_unset(attrs_obj.supplier_item_codes, [])
    cache_row.supplier_item_codes_text = " ".join(codes) if codes else None

The four cache-only fields on CachedVariant:

Field Source Why it's denormalized
parent_archived_at product_or_material.archived_at (extended payload) Filter archived variants out of search without a JOIN
parent_name product_or_material.name Search hits via parent name
display_name parent.name + variant config_attributes Human-readable result rendering
supplier_item_codes_text " ".join(supplier_item_codes) FTS5 tokenizes whitespace-separated text; a list field wouldn't be indexable

An aside on depends_on: the postprocess hook gets parent data from the extended API payload (the variant fetch uses extend=[PRODUCT_OR_MATERIAL] so each variant arrives with its parent inlined), not from cached parent rows. The variant EntitySpec's depends_on=("product", "material") is therefore declarative metadata — useful for FK-join queries against the cache and for _validate_dependency_graph cycle-detection at engine open, but not load-bearing for variant denormalization itself. Cold-start ordering today is handled explicitly: ensure_variants_synced asyncio.gathers the product + material syncs before running the variant sync.

service_id — a cross-table cache-only column (post_sync + preserve-on-conflict)

CachedVariant.service_id is also cache-only, but it's populated differently from the four fields above, and the difference matters. A service variant's parent item id lives on the /services payload (ServiceVariant.service_id), not on /variants — so attrs_postprocess (which only sees the variant's own payload) can't lift it. Instead the service spec carries a post_sync hook (_backfill_service_variant_links) that, after each service sync, writes service_id onto the matching CachedVariant rows. search_items then reads it as a plain column to resolve a service's parent_id (the id modify_item / get_item need), and @cache_read(CachedVariant, CachedService) syncs services before the search so the value is in place.

Because /variants never carries service_id, the variant upsert's blanket ON CONFLICT DO UPDATE SET <every non-id column> would re-null it on every variant delta. The variant spec therefore lists it in preserve_columns_on_conflict=frozenset({"service_id"}) so the backfilled link survives. This is a general contract for any cross-table cache-only column — see Cache-only columns written cross-table need preserve-on-conflict.

Soft-state filtering

The CatalogQueries adapter defaults to filtering archived and soft-deleted rows from every read. This is a behavior change from the legacy CatalogCache, which returned every row regardless. The defaults match what the agent layer almost always wants — surface live catalog only — and the kwargs are there for the rare exception (e.g., modify_item resolving an archived row to unarchive it).

# Default: surface only live rows.
variant = await catalog.get_by_sku("PART-A1-160")

# Opt in to archived parents (variant with archived product/material).
variant = await catalog.get_by_sku("PART-A1-160", include_archived=True)

# Opt in to soft-deleted customers.
customer = await catalog.get_by_id(CachedCustomer, 42, include_deleted=True)

Filtering is implemented generically via hasattr(cls, "<col>") checks at query-build time (see _archive_col_name and _has_deleted_column in queries.py). Adding a new Cached* sibling that carries archived_at or deleted_at works without any adapter changes — the filter logic picks the column up automatically.

Per-entity soft-state surface map (from the inheritance audit in #472's plan):

Class archived_at deleted_at (own) parent_archived_at (cache-only)
CachedVariant
CachedProduct/Material
CachedService
CachedCustomer
CachedSupplier
CachedOperator
CachedAdditionalCost
CachedTaxRate
CachedLocation
CachedFactory

include_archived=True / include_deleted=True are no-ops for classes that don't carry the relevant column.

Worked examples

The tests in katana_mcp_server/tests/test_typed_cache_catalog.py pin these behaviors. Reach for them when you change anything in queries.py or fts.py.

UPC scan via registered_barcode

hits = await catalog.smart_search(CachedVariant, "710845916762")
# → [CachedVariant(sku="PART-A1-PRO-29-160", registered_barcode="710845916762", ...)]

Tokenize: ["710845916762"]. Match: "710845916762"*. Hits variant_fts via the registered_barcode column. BM25 ranks the exact-prefix match first.

Supplier item code

hits = await catalog.smart_search(CachedMaterial, "00.7018.581.003")
# → [CachedMaterial(name="Acme PG-1170 component", ...)]

Tokenize: ["00", "7018", "581", "003"]. Match: "00"* AND "7018"* AND "581"* AND "003"*. The material's supplier_item_codes_text column carries the space-joined supplier codes; all four tokens hit prefixes there.

This is the #473 repro case (originally a real supplier-item-code lookup that returned zero results from the legacy cache). It only works because the tokenizer splits on \W+; with the legacy query.split() the whole string was one unmatchable token.

Multi-token natural-language

hits = await catalog.smart_search(CachedProduct, "kitchen knife")
# → BM25-ranked products with "kitchen" and "knife" appearing in name/category_name

Tokenize: ["kitchen", "knife"]. Match: "kitchen"* AND "knife"*. Behavior unchanged from the legacy cache — multi-token AND with prefix expansion.

Syntax-error fall-through (the test-only path)

The \W+ tokenizer plus _build_fts_match's double-quote escaping protect every user-typed query from producing an invalid FTS5 MATCH expression. Reserved-word queries like "AND OR NOT" tokenize to ["AND", "OR", "NOT"] and emit the valid (if odd) expression "AND"* AND "OR"* AND "NOT"* — the words land inside the quoted string literals and don't trigger FTS5's keyword parser.

This is by design — _is_fts_syntax_error exists to handle the rare case where a future generator regression or a SQLite version bump produces a query the FTS5 parser rejects. The regression test for this path patches exec_driver_sql to raise OperationalError("fts5: syntax error near ...") directly and asserts the fall-through routes to fuzzy.

A LockedError or OperationalError("no such table: ...") would not match the fts5: prefix check and would propagate. That's the design — silent fall-through is reserved for the recoverable case.

Where to look

  • Implementationkatana_mcp_server/src/katana_mcp/typed_cache/queries.py, .../fts.py, .../sync.py.
  • Generatorscripts/generate_pydantic_models.py: CACHE_FTS_SPECS, inject_fts_columns.
  • Testskatana_mcp_server/tests/test_typed_cache_catalog.py covers the full surface (round-trip sync, FTS triggers, smart_search semantics, fall-through, soft-state filtering).
  • Architecture overviewarchitecture.md §Cache.
  • Decision recordADR-0018 for the typed-cache foundation, ADR-0018 status footer for the #472 unification.