Skip to content

Architecture

WARNING

Treat this page as a narrative companion for developers who enjoy reading about low level engineering, not as operational documentation you would rely on for debugging or performance tuning. If something here contradicts the code, the code wins.

System Overview

┌───────────────────────────────────────────────────┐
│                    Client                         │
├───────────────────────────────────────────────────┤
│                   HTTP (port 3000)                │
├───────────────────────────────────────────────────┤
│          Axum Router (src/routes.rs)              │
│     /collections  /search  /suggest  /items       │
├───────────────────────────────────────────────────┤
│            Store Engine (src/store.rs)            │
│  Inverted Index  ·  Tokenization  ·  ID Strategy  │
├───────────────────────────────────────────────────┤
│            fjall LSM-tree Database                │
│        Keyspaces: _collections, _index_queue,     │
│           {col}.inverted, {col}.docs              │
└───────────────────────────────────────────────────┘

The server has three layers:

  1. HTTP Layer — Axum router exposing REST endpoints.
  2. Store Engine — Core logic: tokenization, inverted index management, search/insert.
  3. Persistence Layerfjall LSM-tree database for on-disk storage.

HTTP Layer (src/routes.rs)

An Axum Router maps endpoints to handler functions that delegate to the Store. All state is shared via Arc<Store>.

MethodPathHandler
GET/statusHealth check
GET/collectionsList collections
POST/collectionsCreate collection
GET/collections/{name}Collection metadata
DELETE/collections/{name}Delete collection
POST/collections/{name}/itemsUpsert document
DELETE/collections/{name}/items/{id}Delete document
GET/collections/{name}/search?q=...Search documents
GET/collections/{name}/suggest?q=...Autocomplete
POST/backup/exportExport database snapshot to a file in the dumps folder
POST/backup/importImport a snapshot from the dumps folder

Store Engine (src/store.rs)

The Store struct is the heart of Aperio. It holds:

  • db: fjall::Database — the underlying database handle.
  • config: StoreConfig — tunable parameters (shard sizes, token length, compression, index interval).
  • collections: RwLock<HashMap<String, CollectionMeta>> — in-memory registry of known collections, their ID type and searchable fields.
  • lock: Mutex<()> — serializes write operations (upsert/delete) for index consistency.
  • next_seq: AtomicU64 — monotonic sequence counter for the indexing queue.
  • background_active: AtomicBool — whether the background indexer is running.

Tokenization

Document content is tokenized using charabia:

content → tokenize() → filter(is_word) → lemma() → filter(min_token_length)

Tokens are deduplicated into a HashSet<String> before indexing.

Inverted Index

Each collection has an inverted index stored in a dedicated fjall keyspace ({name}.inverted). For every unique token (word), posting lists map to document IDs.

Word markers — an empty key (word → empty bytes) signals that a word exists in the index, enabling fast prefix scans for autocomplete.

Two ID Strategies

Collections are created with an id_type that determines the posting list format:

id_typeStorage formatData structure
stringrkyv-archived shardsPostingShard { first, last, ids: Vec<String> }
numberSerialized bitmap shardsRoaringTreemap per shard

String IDs

Posting lists are split into shards of configurable max_shard_size (default 1000). Each shard stores sorted Vec<String> archived via rkyv. A binary search across shards locates the correct shard for insertion.

A Vec<u64> would be faster for posting-list operations, but u64 can't represent arbitrary string IDs like UUIDs, so Vec<String> is used as the general-purpose format.

Number IDs

Posting lists use RoaringTreemap bitmaps, sharded at max_roaring_shard_size (default 100,000). Bitmaps offer compact storage and fast bitwise intersection for multi-term queries.

Search Execution

  1. Tokenize the query string.
  2. List shard indices for each token in parallel (via std::thread::scope).
  3. Sort tokens by shard count (rarest-first optimization).
  4. Load posting lists: for string IDs, merge shards in a sorted iterative merge; for number IDs, union shard bitmaps per word, then compute the intersection.
  5. Apply sort and pagination: sort by ID ascending or descending, apply optional after cursor, cap at take.

Search: String IDs

For string-ID collections, each shard is an rkyv-archived PostingShard. The engine loads all shards for the rarest word, then iterates through its sorted IDs, checking membership in other words' shards via binary search.

Search: Number IDs

For number-ID collections, each shard is a RoaringTreemap. Per word, all shards are merged with bitwise OR. Words are then intersected with bitwise AND. The resulting bitmap is iterated in ascending or descending order.

Background Indexing (spawn_background)

When the background indexer is active, upsert() writes to a FIFO queue (_index_queue keyspace) instead of directly updating the index. A tokio::spawn task polls the queue at index_interval (default 900ms) and calls process_pending_queue() to drain entries through upsert_internal().

This batches write operations and reduces lock contention. When the background indexer is not active (e.g., in tests), upsert() calls upsert_internal() synchronously.

Persistence Layer (fjall)

fjall is an embedded LSM-tree storage engine (a RocksDB/Sled alternative). Aperio uses these fjall keyspaces:

| Keyspace | Purpose | |---|---|---| | _collections | Collection name → CollectionMeta (ID type + searchable fields) | | _index_queue | Pending index operations (background indexing) | | {name}.inverted | Inverted index per collection (word → posting lists) | | {name}.docs | Full JSON documents per collection (id → JSON bytes) |

Configurable fjall options exposed via StoreConfig:

  • write_buffer_size — memtable size.
  • compression"none" or "lz4" for data block compression.
  • block_cache_size — global block cache for the database.

Configuration (src/config.rs)

Aperio reads an optional TOML config file (CONFIG_FILE env var). Parsing is silently lenient and errors fall back to defaults with a warning. The AppConfig struct maps one-to-one with StoreConfig fields plus server-level options (block_cache_size, maintenance_threads, log_level, dumps_folder).

The dumps_folder config option sets the directory for backup snapshots. It defaults to None (unset) — if missing, POST /backup/export and POST /backup/import return 400 Bad Request. This prevents accidental file writes when the operator hasn't explicitly configured a dump location.

Error Handling (src/error.rs)

All operations return Result<T, AppError>, an enum that maps to appropriate HTTP status codes:

Error variantHTTP status
NotFound404
BadRequest400
Internal500

Axum's IntoResponse impl renders errors as JSON: {"error": "message"}.

Data Flow: Document Insertion

Client → POST /collections/{name}/items
  → routes::upsert_item()
    → store.upsert(name, doc)
      → [background active?]
        → Yes: write to _index_queue → return
        → No:  lock() → upsert_internal()
          → extract `id` from JSON doc
          → extract searchable field values from JSON doc
          → tokenize combined searchable content (charabia)
          → load old JSON from {name}.docs
          → compute old tokens from old searchable fields
          → remove stale posting list entries
          → add/update posting list entries
          → store full JSON doc in {name}.docs
          → unlock()
Client → GET /collections/{name}/search?q=...
  → routes::search()
    → store.search(name, query, sort, take, after)
      → validate collection exists
      → tokenize query
      → parallel: list shard indices per word
      → sort by rarest word first
      → parallel: load posting lists
      → [string IDs]: sorted merge + membership check
      → [number IDs]: bitmap union + intersection
      → apply after-cursor, sort, limit
      → look up full JSON docs from {name}.docs
      → return Vec<serde_json::Value>