Documentation

Quick Start

Prerequisites

  • Rust 1.70+ (install via rustup)

Build and Run

# Clone the repository
git clone https://github.com/overclockdb/overclockdb.git
cd overclockdb

# Build in release mode
cargo build --release

# Run the server
cargo run --release

The server starts on http://localhost:8190 by default.

Environment Variables

Variable Default Description
OVERCLOCKDB_PORT 8190 HTTP server port
OVERCLOCKDB_DATA_DIR ./data Directory for WAL and snapshots
OVERCLOCKDB_PERSISTENCE true Enable/disable persistence
OVERCLOCKDB_BODY_LIMIT_MB 50 Max request body size in MB (for batch imports)

Health Check

# Liveness check
GET /health

# Readiness check
GET /ready

Collections

# Create a collection
POST /api/v1/collections
Content-Type: application/json

{
  "name": "products",
  "fields": [
    {"name": "title", "type": "string"},
    {"name": "description", "type": "string"},
    {"name": "price", "type": "float", "sort": true},
    {"name": "category", "type": "string", "facet": true}
  ],
  "enable_stemming": true,
  "stem_language": "english",
  "enable_stop_words": true,
  "stop_words_language": "english"
}

# List all collections
GET /api/v1/collections

# Get collection info
GET /api/v1/collections/:name

# Delete a collection
DELETE /api/v1/collections/:name

Collection Management

Rename Collection

PUT /api/v1/collections/:name/rename
Content-Type: application/json

{
  "new_name": "products_v2"
}

Update Schema

Modify field options (index, facet, sort, optional) on existing fields:

PUT /api/v1/collections/:name/schema
Content-Type: application/json

{
  "field_modifications": [
    {"name": "price", "sort": true, "facet": true},
    {"name": "description", "index": false}
  ]
}

Response:
{
  "name": "products",
  "documents_reindexed": 5000,
  "message": "Schema updated and 5000 documents reindexed"
}

Reshard Collection

Redistribute documents across a new number of shards:

POST /api/v1/collections/:name/reshard
Content-Type: application/json

{
  "num_shards": 8
}
Note: Resharding requires exclusive access and temporarily blocks queries while documents are redistributed.

Documents

# Create/update a document
POST /api/v1/collections/:name/docs
Content-Type: application/json

{
  "id": "prod_1",
  "title": "Wireless Headphones",
  "description": "Premium noise-canceling headphones",
  "price": 299.99,
  "category": "electronics"
}

# Batch import documents
POST /api/v1/collections/:name/docs/batch
Content-Type: application/json

{
  "documents": [
    {"id": "1", "title": "Product 1", "price": 99.99},
    {"id": "2", "title": "Product 2", "price": 149.99}
  ]
}

# Batch import response
{
  "imported": 995,
  "errors": [
    {"index": 42, "error": "Missing required field 'title'"},
    {"index": 108, "error": "Invalid price value: expected float"}
  ]
}

# Get a document by ID
GET /api/v1/collections/:name/docs/:id

# Replace a document
PUT /api/v1/collections/:name/docs/:id

# Delete a document
DELETE /api/v1/collections/:name/docs/:id

Suggestions API

Autocomplete and query suggestions based on indexed terms.

Query Suggestions

Get term suggestions based on a prefix:

GET /api/v1/collections/products/suggest?prefix=lap&limit=5

Response:
{
  "suggestions": [
    {"term": "laptop", "score": 1500},
    {"term": "laptops", "score": 850},
    {"term": "laptop-case", "score": 120}
  ],
  "took_ms": 2
}

Parameters

Parameter Type Description
prefixstringThe prefix to match (required)
limitnumberMaximum suggestions to return (default: 10)

Facet Suggestions

Get suggestions for facet values:

GET /api/v1/collections/products/suggest-facets?prefix=Elec&facet=category&limit=5

Response:
{
  "suggestions": [
    {"value": "Electronics", "count": 450},
    {"value": "Electronics/Computers", "count": 200},
    {"value": "Electronics/Phones", "count": 180}
  ]
}

Field Types

Type Description
stringText field (indexed by default)
string[]Array of strings (filterable, facetable)
int3232-bit integer
int32[]Array of 32-bit integers (filterable)
int6464-bit integer
int64[]Array of 64-bit integers (filterable)
float64-bit floating point
float[]Array of 64-bit floats (filterable)
boolBoolean
hierarchyHierarchical category with "/" separator
attributesDynamic key-value pairs for flexible faceting
Note: Numeric array filters use ANY semantics - a document matches if any element in the array satisfies the filter condition.

Field Options

Option Default Description
indextrueEnable full-text search
facetfalseEnable facet counting
sortfalseEnable sorting
optionalfalseAllow missing values
mergefalseEnable merge index for Collection Merge and Aggregation queries

Collection Options

Option Default Description
enable_stemmingfalseEnable word stemming
stem_language"english"Stemming language (see Supported Languages)
enable_stop_wordsfalseFilter common words
stop_words_language"english"Stop words language (see Supported Languages)
enable_vectorsfalseEnable semantic/vector search
vector_fieldsall textFields to generate embeddings from
num_shardsnullNumber of shards for parallel processing

Filtering

field:=value       # Equals
field:!=value      # Not equals
field:>value       # Greater than
field:>=value      # Greater than or equal
field:=100 AND category:^Electronics

Typo Tolerance

OverclockDB supports typo-tolerant search using the SymSpell algorithm:

POST /api/v1/collections/products/search
{
  "q": "laptp",
  "typo_tolerance": 2
}

With typo_tolerance: 2, the query "laptp" will match documents containing "laptop".

Value Description
0 or nullDisabled (exact matching only)
1Allow 1 character difference
2Allow up to 2 character differences

Stemming

Stemming reduces words to their root form for better recall:

  • "running", "runs", "ran" → "run"
  • "computers", "computing" → "comput"
  • "лаптопи" → "лаптоп" (Bulgarian)
POST /api/v1/collections
{
  "name": "articles",
  "fields": [{"name": "content", "type": "string"}],
  "enable_stemming": true,
  "stem_language": "english"
}

Supported Languages (19)

Language API Value Stemmer
Arabic"arabic"Snowball
Bulgarian"bulgarian"BulStem (128K rules)
Danish"danish"Snowball
Dutch"dutch"Snowball
English"english"Snowball (default)
Finnish"finnish"Snowball
French"french"Snowball
German"german"Snowball
Greek"greek"Snowball
Hungarian"hungarian"Snowball
Italian"italian"Snowball
Norwegian"norwegian"Snowball
Portuguese"portuguese"Snowball
Romanian"romanian"Snowball
Russian"russian"Snowball
Spanish"spanish"Snowball
Swedish"swedish"Snowball
Tamil"tamil"Snowball
Turkish"turkish"Snowball
Note: Bulgarian uses the custom BulStem algorithm with ~128,000 stemming rules, lazy-loaded for optimal performance.

Stop Words

Stop words are common words like "the", "a", "is" that are filtered out:

POST /api/v1/collections
{
  "name": "articles",
  "fields": [{"name": "content", "type": "string"}],
  "enable_stop_words": true,
  "stop_words_language": "english"
}

Supported Languages (19)

Stop words are available for all 19 supported languages. Use "none" to disable filtering.

Language API Value Word Count
Arabic"arabic"~120 words
Bulgarian"bulgarian"~260 words
Danish"danish"~100 words
Dutch"dutch"~100 words
English"english"~120 words
Finnish"finnish"~230 words
French"french"~160 words
German"german"~230 words
Greek"greek"~75 words
Hungarian"hungarian"~200 words
Italian"italian"~280 words
Norwegian"norwegian"~175 words
Portuguese"portuguese"~200 words
Romanian"romanian"~230 words
Russian"russian"~70 words
Spanish"spanish"~350 words
Swedish"swedish"~115 words
Tamil"tamil"~100 words
Turkish"turkish"~115 words
None"none"Disabled
Note: Stop word lists sourced from Alir3z4/stop-words (CC-BY-4.0).

Hierarchical Categories

The hierarchy field type enables tree-structured categories with automatic ancestor indexing.

Schema Definition

POST /api/v1/collections
{
  "name": "products",
  "fields": [
    {"name": "title", "type": "string"},
    {"name": "category", "type": "hierarchy", "facet": true},
    {"name": "price", "type": "float", "sort": true}
  ]
}

Document Examples

// Single category
{"id": "laptop-1", "title": "MacBook Pro", "category": "Electronics/Computers/Laptops"}

// Multiple categories
{"id": "organizer-1", "title": "Desk Organizer", "category": ["Office/Supplies", "Home/Storage"]}

Searching Hierarchies

# All electronics (matches Laptops, Desktops, Phones, etc.)
{"q": "*", "filter": "category:^Electronics", "facets": ["category"]}

# Only computers
{"q": "*", "filter": "category:^Electronics/Computers"}

# Multiple hierarchies (OR)
{"q": "*", "filter": "category:^[Electronics,Clothing]"}

Drill-Down Navigation

# Get children of Electronics
{"q": "*", "facets": ["category"], "hierarchy_parent": "Electronics"}

Response includes hierarchy_facets:

{
  "hierarchy_facets": {
    "category": [
      {"path": "Electronics/Computers", "name": "Computers", "count": 60, "depth": 1, "has_children": true},
      {"path": "Electronics/Phones", "name": "Phones", "count": 40, "depth": 1, "has_children": false}
    ]
  }
}

Internationalization (i18n)

OverclockDB supports translated facet labels for multilingual product catalogs.

Set Translations

PUT /api/v1/collections/products/translations
{
  "field": "category",
  "translations": [
    {
      "value": "Electronics",
      "labels": {"en": "Electronics", "es": "Electrónicos", "de": "Elektronik"}
    },
    {
      "value": "Electronics/Computers",
      "labels": {"en": "Computers", "es": "Computadoras", "de": "Computer"}
    }
  ]
}

Search with Language

POST /api/v1/collections/products/search
{
  "q": "laptop",
  "facets": ["category", "brand"],
  "language": "es"
}

Response includes translated labels:

{
  "facets": {
    "brand": [
      {"value": "apple", "label": "Apple Inc.", "count": 50}
    ]
  },
  "hierarchy_facets": {
    "category": [
      {"path": "Electronics", "label": "Electrónicos", "count": 100}
    ]
  }
}

Language Fallback

When a translation is not found:

  1. Requested language (e.g., "es")
  2. English ("en")
  3. Raw field value

Sharding

Hash-based document sharding for parallel search with 2-3x speedup.

Create a Sharded Collection

POST /api/v1/collections
{
  "name": "products",
  "fields": [
    {"name": "title", "type": "string"},
    {"name": "brand", "type": "string", "facet": true},
    {"name": "price", "type": "float"}
  ],
  "num_shards": 4
}

How It Works

  1. Document Routing: Documents distributed via hash of document ID
  2. Parallel Search: Queries run on all shards concurrently
  3. Result Merging: K-way heap merge with global BM25 statistics
  4. Facet Aggregation: Counts summed across all shards

Performance

Collection Size Regular 4 Shards Speedup
50K documents949 µs352 µs2.7x faster
100K documents1.17 ms386 µs3x faster

Recommendations

Collection Size Recommended Shards
< 50K docs1 (no sharding)
50K - 500K4 shards
500K - 2M8 shards
> 2M docs8-16 shards

Aggregation Framework

A powerful multi-source data merging system with computed fields, pattern expansion, and priority-based selection. Ideal for B2B pricing, inventory aggregation, and rating computation.

Overview

The Aggregation Framework enables:

  • Multi-source loading - Load data from multiple collections with priority-based fallback
  • Pattern expansion - Dynamic collection names using context variables (e.g., prices_customer_{customer_id})
  • Priority merging - Select best value using configurable strategies
  • Computed fields - Calculate derived fields using expressions
  • Search integration - Filter and sort by aggregated/computed fields

Aggregation Config API

# Create an aggregation config
POST /api/v1/aggregations
{
  "name": "b2b_pricing",
  "merge_key": "product_id",
  "sources": [...],
  "priority_strategy": {...},
  "computed_fields": [...]
}

# List all configs
GET /api/v1/aggregations

# Get a specific config
GET /api/v1/aggregations/:name

# Update a config
PUT /api/v1/aggregations/:name

# Delete a config
DELETE /api/v1/aggregations/:name

Complete B2B Pricing Example

This example implements customer-specific pricing with group fallbacks and computed discounts.

Step 1: Create Collections

# Products collection (base data)
POST /api/v1/collections
{
  "name": "products",
  "fields": [
    {"name": "title", "type": "string"},
    {"name": "category", "type": "hierarchy", "facet": true},
    {"name": "brand", "type": "string", "facet": true},
    {"name": "msrp", "type": "float"}
  ]
}

# Customer-specific prices (merge-enabled)
POST /api/v1/collections
{
  "name": "prices_customer_vip123",
  "fields": [
    {"name": "price", "type": "float", "merge": true, "sort": true},
    {"name": "discount_percent", "type": "float", "merge": true},
    {"name": "allow_discount", "type": "bool", "merge": true}
  ]
}

# Group prices (wholesale, retail, etc.)
POST /api/v1/collections
{
  "name": "prices_group_wholesale",
  "fields": [
    {"name": "price", "type": "float", "merge": true, "sort": true},
    {"name": "discount_percent", "type": "float", "merge": true},
    {"name": "allow_discount", "type": "bool", "merge": true}
  ]
}

# Default prices (fallback)
POST /api/v1/collections
{
  "name": "prices_default",
  "fields": [
    {"name": "price", "type": "float", "merge": true, "sort": true},
    {"name": "discount_percent", "type": "float", "merge": true},
    {"name": "allow_discount", "type": "bool", "merge": true}
  ]
}

Step 2: Add Sample Data

# Add products
POST /api/v1/collections/products/docs/batch
{
  "documents": [
    {"id": "laptop_1", "title": "Gaming Laptop Pro", "category": "Electronics/Computers", "brand": "TechBrand", "msrp": 1500.00},
    {"id": "laptop_2", "title": "Business Laptop", "category": "Electronics/Computers", "brand": "WorkPro", "msrp": 1200.00}
  ]
}

# Customer-specific prices (VIP gets 20% discount)
POST /api/v1/collections/prices_customer_vip123/docs/batch
{
  "documents": [
    {"id": "laptop_1", "price": 1200.00, "discount_percent": 20, "allow_discount": true}
  ]
}

# Wholesale group prices
POST /api/v1/collections/prices_group_wholesale/docs/batch
{
  "documents": [
    {"id": "laptop_1", "price": 1350.00, "discount_percent": 10, "allow_discount": true},
    {"id": "laptop_2", "price": 1080.00, "discount_percent": 10, "allow_discount": true}
  ]
}

# Default prices
POST /api/v1/collections/prices_default/docs/batch
{
  "documents": [
    {"id": "laptop_1", "price": 1500.00, "discount_percent": 0, "allow_discount": false},
    {"id": "laptop_2", "price": 1200.00, "discount_percent": 0, "allow_discount": false}
  ]
}

Step 3: Create Aggregation Config

POST /api/v1/aggregations
{
  "name": "b2b_pricing",
  "merge_key": "product_id",
  "sources": [
    {
      "pattern": "prices_customer_{customer_id}",
      "priority": 1,
      "exact": true
    },
    {
      "collection": "prices_default",
      "priority": 2
    }
  ],
  "priority_strategy": {
    "type": "by_priority",
    "prefer_exact": true
  },
  "computed_fields": [
    {
      "name": "final_price",
      "expression": "if(allow_discount, price * (1 - discount_percent / 100), price)"
    },
    {
      "name": "savings",
      "expression": "price - final_price"
    }
  ]
}

Step 4: Search with Pricing Context

POST /api/v1/collections/products/search
{
  "q": "laptop",
  "query_by": ["title"],
  "aggregation": {
    "config_name": "b2b_pricing",
    "context": {
      "customer_id": "vip123"
    }
  },
  "filter": "final_price:<1400",
  "sort_by": "final_price:asc",
  "facets": ["category", "brand"],
  "limit": 10
}

Response

{
  "found": 2,
  "took_ms": 8,
  "hits": [
    {
      "id": "laptop_1",
      "title": "Gaming Laptop Pro",
      "category": "Electronics/Computers",
      "brand": "TechBrand",
      "score": 0.95,
      "price": 1200.00,
      "discount_percent": 20,
      "allow_discount": true,
      "final_price": 960.00,
      "savings": 240.00
    },
    {
      "id": "laptop_2",
      "title": "Business Laptop",
      "category": "Electronics/Computers",
      "brand": "WorkPro",
      "score": 0.85,
      "price": 1080.00,
      "discount_percent": 10,
      "allow_discount": true,
      "final_price": 972.00,
      "savings": 108.00
    }
  ],
  "facets": {
    "category": [{"value": "Electronics/Computers", "count": 2}],
    "brand": [{"value": "TechBrand", "count": 1}, {"value": "WorkPro", "count": 1}]
  }
}

Config Reference

{
  "name": "config_name",
  "merge_key": "product_id",
  "sources": [
    {
      "collection": "static_collection_name",
      "pattern": "dynamic_collection_{variable}",
      "priority": 1,
      "exact": false,
      "shard_by": "shard_field",
      "fields": { "source_field": "target_field" }
    }
  ],
  "priority_strategy": {
    "type": "by_priority",
    "prefer_exact": true
  },
  "computed_fields": [
    {
      "name": "computed_field_name",
      "expression": "price * (1 - discount / 100)"
    }
  ]
}

Source Options

Field Type Description
collectionstringStatic collection name (mutually exclusive with pattern)
patternstringDynamic collection pattern with {variable} placeholders
prioritynumberPriority level (lower = higher priority, default: 0)
exactbooleanMark as "exact" match for prefer_exact strategies
shard_bystringField to shard by for shard-keyed collections
fieldsobjectField name mappings: {"source": "target"}

Priority Strategies

Strategy JSON Description
By Priority{"type": "by_priority", "prefer_exact": true}Lower priority number wins. If prefer_exact, exact sources preferred.
Min Value{"type": "min_value", "field": "price"}Select record with minimum value of specified field
Max Value{"type": "max_value", "field": "rating"}Select record with maximum value of specified field
First Match{"type": "first_match"}Take first match by priority order
All{"type": "all"}Return all matches (no merging)

Expression Language

Computed fields use a simple expression language for calculations.

Arithmetic Operators

price + tax                    # Addition
price - discount               # Subtraction
price * quantity               # Multiplication
total / count                  # Division

Comparison Operators

price > 100                    # Greater than
price >= 100                   # Greater than or equal
price < 100                    # Less than
price <= 100                   # Less than or equal
status == "active"             # Equal
status != "inactive"           # Not equal

Logical Operators

in_stock && price < 100        # AND
is_sale || is_clearance        # OR
!is_discontinued               # NOT

Conditional Expression

if(condition, then_value, else_value)

# Examples:
if(allow_discount, price * 0.9, price)
if(quantity > 10, price * 0.95, price)
if(is_member && total > 100, total * 0.85, total)

Built-in Functions

min(a, b)                      # Minimum of two values
max(a, b)                      # Maximum of two values
min(price, sale_price, promo)  # Minimum of multiple values
max(rating1, rating2, rating3) # Maximum of multiple values

Pattern Expansion

Dynamic collection names are resolved at query time using context variables.

Simple Variable Expansion

Pattern: "prices_customer_{customer_id}"
Context: {"customer_id": "vip123"}
Result:  "prices_customer_vip123"

Querying Multiple Shard Values

For shard-keyed collections, use the shard_values parameter directly:

POST /api/v1/collections/prices/search
{
  "q": "*",
  "shard_values": ["wholesale", "retail", "vip"]
}

Use Case Examples

Inventory Aggregation

{
  "name": "inventory_aggregation",
  "merge_key": "sku",
  "sources": [
    { "pattern": "inventory_warehouse_{warehouse_id}", "priority": 1 },
    { "collection": "inventory_central", "priority": 2 }
  ],
  "priority_strategy": { "type": "max_value", "field": "quantity" },
  "computed_fields": [
    { "name": "in_stock", "expression": "quantity > 0" },
    { "name": "low_stock", "expression": "quantity > 0 && quantity < 10" }
  ]
}

Rating Aggregation

{
  "name": "product_ratings",
  "merge_key": "product_id",
  "sources": [
    { "collection": "internal_reviews", "priority": 1, "fields": {"rating": "internal_rating", "count": "internal_count"} },
    { "collection": "external_reviews", "priority": 2, "fields": {"rating": "external_rating", "count": "external_count"} }
  ],
  "priority_strategy": { "type": "all" },
  "computed_fields": [
    { "name": "avg_rating", "expression": "(internal_rating + external_rating) / 2" },
    { "name": "total_reviews", "expression": "internal_count + external_count" }
  ]
}

Performance

Metric Value
Products tested40K+
Source collections per query10+
Query time (p99)<50ms for top 100 results
Pattern expansion~500ns per variable
Expression evaluation~200ns per expression

Collection Merge

Join separate collections at query time (similar to SQL JOIN). Useful for context-specific pricing, customer overrides, or any scenario requiring data from multiple collections merged into search results.

Note: For complex multi-source scenarios with computed fields, use the Aggregation Framework instead.

Create Merge-Enabled Collection

Set merge: true on fields to enable merge indexing:

POST /api/v1/collections
{
  "name": "prices_store_123",
  "fields": [
    {"name": "price", "type": "float", "merge": true, "sort": true},
    {"name": "discount", "type": "float", "merge": true}
  ]
}

# Add prices (id = product ID from base collection)
POST /api/v1/collections/prices_store_123/docs
{"id": "prod_456", "price": 99.99, "discount": 10}
{"id": "prod_789", "price": 149.99, "discount": 0}

Search with Merge

POST /api/v1/collections/products/search
{
  "q": "laptop",
  "merge": {
    "collections": ["prices_customer_vip", "prices_store_123"],
    "priority_collection": "prices_customer_vip",
    "comparison_field": "price",
    "strategy": "min",
    "return_fields": ["price", "discount"]
  },
  "sort_by": "price:asc",
  "limit": 10
}

Merge Configuration

Parameter Type Description
collectionsstring[]Collections to merge into results (required)
priority_collectionstringIf this collection has a value, use it directly
comparison_fieldstringField for min/max strategy (default: first merge field)
strategystringHow to combine values: "min" or "max" (default: "min")
return_fieldsstring[]Which fields to include in response

Resolution Logic

  1. Query all merge collections in parallel for matching document IDs
  2. If priority_collection is set AND has a value → use it directly
  3. Otherwise → apply strategy (min/max) across all collections
  4. Documents without ANY matching merge entry → excluded from results

Response Format

Collection merge uses a flat response format with all fields at the root level:

{
  "found": 150,
  "took_ms": 3,
  "hits": [
    {
      "id": "prod_456",
      "title": "Gaming Laptop",
      "category": "Electronics",
      "score": 0.95,
      "price": 89.99,
      "discount": 10
    }
  ]
}
Note: Reserved fields (id, score, text_score, vector_score) cannot be overwritten by merge fields.

Performance

  • Indexed sorting: O(offset + limit) for single-collection sorted merge queries
  • Parallel lookups: Multiple merge collections queried concurrently
  • Benchmark: 170x-11,700x faster than manual sort for sorted iteration

Attribute Facets

Dynamic key-value faceting for flexible product attributes. Unlike regular facets, attribute facets support arbitrary keys that vary per document.

Schema Definition

POST /api/v1/collections
{
  "name": "products",
  "fields": [
    {"name": "title", "type": "string"},
    {"name": "specs", "type": "attributes", "facet": true}
  ]
}

Document Examples

// Laptop with CPU, RAM, Storage specs
{
  "id": "laptop-1",
  "title": "MacBook Pro 16",
  "specs": {
    "cpu": "M3 Pro",
    "ram": "18GB",
    "storage": "512GB SSD"
  }
}

// Phone with different specs
{
  "id": "phone-1",
  "title": "iPhone 15 Pro",
  "specs": {
    "cpu": "A17 Pro",
    "storage": "256GB",
    "display": "6.1 inch"
  }
}

Search with Attribute Facets

POST /api/v1/collections/products/search
{
  "q": "*",
  "facets": ["specs"],
  "max_attribute_types": 10,
  "max_attribute_values": 5
}

Response Format

{
  "found": 100,
  "hits": [...],
  "attribute_facets": {
    "specs": {
      "types": ["cpu", "ram", "storage", "display"],
      "values": {
        "cpu": [
          {"value": "M3 Pro", "count": 15},
          {"value": "A17 Pro", "count": 12},
          {"value": "i7-13700", "count": 8}
        ],
        "ram": [
          {"value": "18GB", "count": 20},
          {"value": "16GB", "count": 18}
        ],
        "storage": [
          {"value": "512GB SSD", "count": 25},
          {"value": "256GB", "count": 22}
        ]
      }
    }
  }
}

Filtering on Attributes

# Filter by specific attribute value
"filter": "specs.cpu:=M3 Pro"

# Combine multiple attribute filters
"filter": "specs.cpu:=M3 Pro AND specs.ram:=18GB"

Shard-Keyed Collections

Route documents to shards based on a field value (e.g., customer_id) instead of document ID. This enables single-shard queries when the shard key is known.

Create Shard-Keyed Collection

POST /api/v1/collections
{
  "name": "prices",
  "fields": [
    {"name": "customer_id", "type": "string"},
    {"name": "product_id", "type": "string"},
    {"name": "price", "type": "float", "merge": true}
  ],
  "shard_config": {
    "shard_key": "customer_id",
    "num_shards": 8
  }
}

How It Works

  1. Document routing: Documents are distributed to shards based on hash of the shard_key field value
  2. Single-shard queries: When shard key is provided, query hits only one shard (O(1) routing)
  3. Scatter-gather: Without shard key, query searches all shards in parallel

Query with Shard Key

# Fast path - single shard lookup
POST /api/v1/collections/prices/search
{
  "q": "*",
  "filter": "customer_id:=vip123",
  "limit": 100
}
# Routes directly to shard containing customer "vip123"

Use Cases

  • Per-customer pricing: Shard by customer_id for fast customer-specific price lookups
  • Multi-tenant data: Shard by tenant_id for data isolation
  • Geographic data: Shard by region for localized queries

vs. Regular Sharding

Feature Regular Sharding Shard-Keyed
Routing basisDocument ID hashField value hash
Single-shard queriesBy document ID onlyBy shard key value
Best forEven distributionKey-based access patterns

Design Considerations

When managing data for many contexts (e.g., thousands of customers), you have two architectural choices:

Option 1: Separate Collections per Context

prices_customer_001
prices_customer_002
prices_customer_003
... (thousands of collections)
  • Pros: Complete physical isolation, independent retention policies, different schemas per customer
  • Cons: Management overhead (thousands of collections), memory overhead per collection (indexes, metadata)

Option 2: Shard-Keyed Single Collection

prices (shard_key: customer_id, num_shards: 8)
└── Contains all customer data, partitioned by hash
  • Pros: Single collection to manage, efficient memory usage, cross-customer analytics possible
  • Cons: Data mixed at shard level (but filtered at query time)

Why Data Mixing is Acceptable

When you query with filter: "customer_id:=vip123":

  1. System calculates hash("vip123") % 8 → routes to single shard
  2. The customer_id index on that shard instantly narrows results to ~100 docs (not 100,000)
  3. Other customers' data on the same shard is never returned - filtered out by the index

Result: Query-time isolation with the efficiency of a single collection.

When to Use Each Approach

Use Case Recommendation
Same schema for all contextsShard-keyed single collection
Compliance requires physical isolationSeparate collections
Different retention policies per contextSeparate collections
Need cross-context analyticsShard-keyed single collection
Thousands of contextsShard-keyed single collection
Contexts have different schemasSeparate collections